Re: [Linux-HA] odd cluster failure
On Thu, Feb 9, 2017 at 2:28 AM, Ferenc Wágnerwrote: > Looks like your VM resource was destroyed (maybe due to the xen balloon > errors above), and the monitor operation noticed this. > Thank you for helping me interpret that. I think what happened is that the VM in question (radnets) is the only one that did not have maxmem specified in the config file. It probably suffered memory pressure and the hypervisor tried to give it more memory, but ballooning is turned off in the hypervisor. That's probably where the balloon errors come from. The VM probably got hung up because it ran out of memory, causing the monitor to fail. There is a little guesswork going on here, because I do not fully understand how Xen ballooning works (or is supposed to work), but it seems like I should set maxmem for this VM like all the others, and I increased it's available memory as well. Now I just wait and see if it happens again. --Greg ___ Linux-HA mailing list is closing down. Please subscribe to us...@clusterlabs.org instead. http://clusterlabs.org/mailman/listinfo/users ___ Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha
[Linux-HA] odd cluster failure
For the second time in a few weeks, we have had one node of a particular cluster getting fenced. It isn't totally clear why this is happening. On the surviving node I see: Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:49:00 vmc1 kernel: igb :03:00.1 eth3: igb: eth3 NIC Link is Down Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming new configuration. OK, so from this point of view, it looks like the link was lost between the two hosts, resulting in fencing. The link is a crossover cable, so no networking hardware other than the host NICs and the cable. On the other side I see: Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: fail-count-VM-radnets (1) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37: fail-count-VM-radnets=1 Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: last-failure-VM-radnets (1486079209) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39: last-failure-VM-radnets=1486079209 Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor for VM-radnets on vmc2.ucar.edu: not running (7) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover VM-radnets#011(Started vmc2.ucar.edu) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914: /var/lib/pacemaker/pengine/pe-input-317.bz2 Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop VM-radnets_stop_0 on vmc2.ucar.edu (local) Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be stopped (timeout: 80s) Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, protocol 1 (x86_64-abi) Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state (and then there are a bunch of null bytes, and the log resumes with reboot) More messages about networking, except that xenbr1 is not the bridge device associated with the NIC in question. I don't see any reason why the link between the hosts should suddenly stop working, so I am suspecting a hardware problem that only crops up rarely (but will most likely get worse over time). Is there anything anyone can see in the log that would suggest otherwise? Thank you, --Greg ___ Linux-HA mailing list is closing down. Please subscribe to us...@clusterlabs.org instead. http://clusterlabs.org/mailman/listinfo/users ___ Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha
[Linux-HA] Corosync 1 - 2
I notice that the network:ha-clustering:Stable repo for CentOS 6 now contains Corosync 2.3.3-1 . I am currently running 1.4.1-17 . Is it safe to just run this update? Are there configuration changes I have to make in order for the new version to work? (If there is a document or wiki page describing how to convert from Corosync 1 to 2, I would be happy to be pointed to it). Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Corosync 1 - 2
On Wed, Oct 1, 2014 at 8:44 AM, Digimer li...@alteeve.ca wrote: Personally, I would not upgrade. If you do, you will want to test outside of production first. Of course, I would always do that anyway, even without a major version number change. Corosync needed cman to be a quorum provider in the 1.x series. In 2.x, it became it's own quorum provider and cman was no longer needed. Last I heard upstream, pacemaker on EL6 is only supported on corosync 1.4 + cman. There is a pacemaker update too, to 1.1.12+git20140723.483f48a-1.1 I'm sure you're not concerned about paid support, but it does mean that the corosync 1.4 stack is much better tested on EL6 than 2.x is. OK, thanks. What I am really trying to figure out is exactly what the network_ha-clustering_Stable repo is for. Presumably, from the name, they wouldn't put anything in there that isn't ready for production? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Corosync 1 - 2
On Wed, Oct 1, 2014 at 2:04 PM, Digimer li...@alteeve.ca wrote: Who runs the repo? It's not a name I am familiar with. It comes from opensuse.org . I'm pretty sure I got it out of one of the documents on the clusterlabs site, but I would have to go back and verify that to be certain. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple colocation with same resource group
On Fri, 2014-02-21 at 12:37 +, Tony Stocker wrote: colocation inf_ftpd inf: infra_group ftpd or do I need to use an 'order' statement instead, i.e.: order ftp_infra mandatory: infra_group:start ftpd I'm far from a leading expert on this, but in my experience, colocation and order are completely separate concepts. If you want both, you have to state both. So I would say you need both colocation and order statements to get what you want. I have similar scenarios, where virtual machines depend on the underlying DRBD device, so I colocate the DRBD master-slave resource and the VM resource, then have an order statement to say the DRBD resource must be started before the VM. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd disks in secondary/secondary diskless/diskless mode
On 08/14/2013 02:12 PM, Fredrik Hudner wrote: I have tried to make one node primary but only get: 0: State change failed: (-2) Need access to UpToDate data Command 'drbdsetup primary 0' terminated with exit code 17 When you've suffered a sudden disconnect, you can get into a situation where both sides think their information is outdated. To recover, you have to tell the cluster which node can throw away its data in favor of what the other node has. http://www.drbd.org/users-guide/s-resolve-split-brain.html --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] heartbeat 'ERROR' messages
I have two clusters that are both running CentOS 5.6 and heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the first one and pacemaker-1.0.12-1.el5 on the other) They both have identical ha.cf files except that the bcast device names are different (and they are correct for each case, I checked), like this: udpport 694 bcast eth2 bcast eth1 use_logd off logfile /var/log/halog debugfile /var/log/hadebug debug 1 keepalive 2 deadtime 15 initdead 60 node vmd1.ucar.edu node vmd2.ucar.edu auto_failback off respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn On one of them (which maybe or maybe not coincidentally is having some problems), I get these messages logged about every 2 seconds in /var/log/halog, on the other I don't see them: May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 10 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=NS_ackmsg] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [dest=vmx2.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [ackseq=3a0] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [(1)destuuid=0x5ceb280(37 28)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1 23b556bcb61a08abecf87cb6411c62e62cf99f0d] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping message with 12 fields May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] : [t=status] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] : [st=active] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] : [dt=3a98] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] : [protocol=1] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [seq=17b] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [hg=4c97c17a] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ts=51a13435] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [ld=0.27 0.41 0.26 1/315 19183] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] : [ttl=3] May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] : [auth=1 3d3da4df831636f7c274395041ffb49bbf215170] The questions are what do these messages actually mean, why is one cluster logging them and not the other, and is this something I should be worried about? Thanks for any info, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
I know it's tacky to reply to myself, but I can answer one of my questions after another 15 minutes or so of poring through logs: On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote: The questions are what do these messages actually mean, why is one cluster logging them and not the other, and is this something I should be worried about? The answer to the last one is that this is definitely a problem, because after nearly half an hour, this is logged: May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] : [src=vmx1.ucar.edu] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] : [(1)srcuuid=0x5ceb390(36 27)] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] : [seq=3a4] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] : [hg=4c97c17a] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ts=51a13888] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [ld=0.50 0.33 0.28 3/316 13859] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] : [ttl=3] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] : [auth=1 feb94da356847a538290ea75f27423c996c0a595] May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: write_child: Exiting due to persistent errors: No such device May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBWRITE process 5689 exited with return code 1. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: HBWRITE process died. Beginning communications restart process for comm channel 1. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth4 - Status: 1 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBREAD process 5690 killed by signal 9 [SIGKILL - Kill, unblockable]. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: Both comm processes for channel 1 have died. Restarting. May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth4 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth4 - Status: 1 May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: Communications restart succeeded. May 25 16:17:45 vmx1.ucar.edu heartbeat: [5683]: info: Link vmx2.ucar.edu:eth4 up. And VMs stop being reachable, etc. The only way to stabilize things is to not start heartbeat on one of the nodes (vmx1 arbitrarily chosen) and run all resources on a single node (vmx2 in this case). --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat 'ERROR' messages
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote: respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn I don't know about the rest, but definitely do not use both ipfail and crm. Pick one :) I guess I will have to look into what ipfail really does. I have a half dozen clusters that have virtually the same ha.cf files and they have been running for 2+ years with it specified this way. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: vm live migration without shared storage
On Fri, 2013-05-24 at 10:45 +0200, Ulrich Windl wrote: You are still mixing total migration time (which may be minutes) with virtual stand-still time (which is a few seconds). Correct. It was not clear (to me) that when the time to migrate was several minutes, the actual service outage was only a few seconds. This point has now been made (several times), and it is a big difference. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] vm live migration without shared storage
On Thu, 2013-05-23 at 15:00 -0400, David Vossel wrote: Migration time, depending on network speed and hardware, is much longer than the shared storage option (minutes vs. seconds). This is just one data point (of course), but for the vast majority of services that I run, if the live migration time is as long as it takes to shut down a VM and boot it on another server, then there isn't much of an advantage to doing the live migration. Especially if we're talking about an option that is a long way from being battle-tested, and critical services such as DNS and authentication. Most of these critical services do not use long-lived connections. I can see a few VMs that exist to provide ssh logins where a minutes-long live migration would be clearly preferable to a shut down and reboot, but in most cases, if it's as slow as rebooting, it isn't going to be any advantage to me. It will be interesting though to see how many applications people come up with where a minutes-long live migration is preferable to shutdown and reboot. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: DRBD NetworkFailure
On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote: Greg Woods wo...@ucar.edu schrieb am 23.04.2013 um 21:20 in Nachricht Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure - Unconnected ) You chould use ethtool to check the interface statistics; otherwise I'd vote for a software issue... Ethtool doesn't show any errors, but it's possible that the errors don't start occurring until just before DRBD detects the issue. Unfortunately I can't access the system once the problems start occurring so I can't run ethtool at that point. If it's a software issue, what is it likely to be? I have to find some way to debug this, I'm getting some flak about the outages this is causing, even though, so far, they have been three weeks apart. And it won't be long before this happens at 3AM, which will really suck. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd error message decoding help
On Wed, 2013-04-24 at 12:11 +0200, Lars Ellenberg wrote: drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2: Calling /usr/sbin/crm_master -Q -l reboot -v 1 I apologize for the noise about this. Further checks of the logs on all my clusters show that this is normal behavior. I started a different thread DRBD NetworkFailure which is hopefully closer to the mark. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clean shutdown procedure?
On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote: On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote: On Saturday, April 20, 2013, Greg Woods wrote: Often one of the nodes gets stuck at Stopping HA Services That means pacemaker is waiting for one of your resources to stop. Do you have anything that would take a long time (or fail to stop)? Not that I am aware of. But some things that came up during this weekend's powerdown make me think that some of the stop actions are failing This particular issue has been solved. It turns out that this is one of those perfect storm situations. Because of the coming powerdown, our HPSS (High Performance Storage System) was shut down several hours prior to the HA clusters going down. The HA clusters do not directly depend on the HPSS, but they do run backups to it. The incremental backup script works by taking an LVM snapshot of the logical volume that the file system containing the virtual machine images is mounted on, then mounting the virtual disk images from the snapshot, and finally, running our standard system backup script on the mounted images. The system backup script will normally run a find on the file system(s) to be backed up, and package it up into multiple cpio archives (as many as it takes for either the full file system or just the files that have changed in the past two days). Once an archive file has been created, it gets sent to the HPSS. It turns out that the script will try multiple times to send the file if the first attempt fails, which can actually cause it to continue running and retrying for many hours. While it is running, the snapshot is still in place. The cluster resource stop failed on one of the LVM resources, saying that the volume group could not be deactivated because there was still an active logical volume. The snapshot. So that caused the fence. This still doesn't fully explain the original issue of why the shutdown process can hang trying to stop the heartbeat service. Or does it? Since I wasn't looking for this, I can't be certain that the HPSS wasn't offline during the times I have observed these hangs, so I'll have to start checking for that. In the meantime, I'll have to create a shutdown script that checks for a hung backup, kills it, and deletes the snapshot before issuing the /sbin/shutdown command. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] DRBD NetworkFailure
Here's a new issue. We have had two outages, about 3 weeks apart, on one of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this was logged: Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in time. Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure - Unconnected ) Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected - WFConnection ) Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in time. Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID 37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure - Unconnected ) Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected - WFConnection ) This looks like a long-winded way of saying that the DRBD devices went offline due to a network failure. One time this was logged on one node, and the other time it was logged on the other node, so that would seem to rule out any issue internal to one node (such as bad memory). In both cases, nothing else is logged in any of the HA logs or the /var/log/messages file. Obviously, the VMs stop providing services and this is how the problem is noticed (DNS server not responding, etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices that anything is wrong, since nothing is logged after the above until the restart messages when I finally cycle the power via IPMI (which was almost half an hour later). The two nodes are connected by a crossover cable, and that is the link used for DRBD replication. So it seems as though the only possibilities are a flaky NIC or a flaky cable, but in that case, wouldn't I see some sort of hardware error logged? Anybody else ever seen something like this? Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clean shutdown procedure?
On Fri, 2013-04-19 at 16:43 +0200, Florian Crouzat wrote: crm configure property OK, thanks for the suggestions. What is the difference between maintenance-mode=true and stop-all-resources=true? I tried the latter first, and all the resources do stop, except that all the stonith resources are still running. I'm just worried about the possibility of a STONITH death match occurring at the next reboot; I'd rather see the stonith resources stopped too. Or is there some reason why that would not be desirable? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] drbd error message decoding help
I realize that nobody can solve a problem based on a single log entry, but I am trying to understand what happened with a cluster problem today. A similar thing happened with this cluster about 3 weeks ago, so this is one of those hard-to-solve intermittent issues. But it might help me now if I understood better what this message actually means: drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2: Calling /usr/sbin/crm_master -Q -l reboot -v 1 It looks like a drbd process is calling a CRM process? Or is it the other way around (which would make more sense?) Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat IPv6addr OCF
On Sun, 2013-03-24 at 01:36 -0700, tubaguy50035 wrote: params ipv6addr=2600:3c00::0034:c007 nic=eth0:3 \ Are you sure that's a valid IPV6 address? I get headaches every time I look at these, but it seems a valid address is 8 groups, and you've got 5 there. Maybe you mean 2600:3c00::0034:c007? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Using a Ping Daemon (or Something Better) to PreventSplit Brain
On Thu, 2013-01-31 at 02:09 +, Robinson, Eric wrote: the secondary should wait for a manual command to become primary. That can be accomplished with the meatware STONITH device. Requires a command to be run to tell the wannabe primary that the secondary is really dead (and, of course, you had better be sure that the secondary is really dead before the command is run to avoid split brain). 2. The secondary should refuse to become primary even if manually ordered to do so if it cannot communcate with DataCenterC. I don't know any way to do that exactly, but you might be able to use order constraints to require some sort of ping-based resource to be successfully started before the other resources can start. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to diagnose stonith death match?
On Thu, 2013-01-10 at 08:35 +1100, Andrew Beekhof wrote: On Wed, Jan 9, 2013 at 4:16 PM, Greg Woods wo...@ucar.edu wrote: I got the cluster running with xend by moving the heartbeat to a different interface. Having heartbeat start after the bridge is created _should_ also work. Obviously that can't work if xend is a cluster resource. Can you split up the networking part from the other pieces? Not without hacking around in the Xen script files (which of course are part of the distro's packages and my changes would get overwritten every time I had to update). It is easier to just use an interface that doesn't have a Xen bridge on it for heartbeat. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to diagnose stonith death match?
On Tue, 2013-01-08 at 09:18 +1100, Andrew Beekhof wrote: On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote: The problem is that either node can come up and run all the resources, but as soon as I bring the other node online, it briefly looks normal, but as soon as the stonith resource starts, the currently running node gets fenced and the new node takes over all the resources. Then the fenced node comes up, fences the other node and takes over, etc. Death match. Thats odd. Normally its a firewall issue. Did you happen to choose a different port perhaps? Close, but not quite. I did finally figure out what was going on, as the death match started again as I was reconfiguring the cluster from scratch, but this time I knew more about what was causing it. It started as soon as I added xend as a resource. A little trial and error showed that the heartbeat does not work if it is on an interface that also has a Xen bridge attached to it. This is unexpected because all the other kinds of networking on that interface work fine with the bridge active (e.g. ssh connections, IPMI connections, etc.), only heartbeat is affected. But it was absolutely reproducible. If I started xend by hand instead of having it as a cluster resource, again I got a death match. A careful reading of the logs did show that heartbeat was declaring the other node dead. So for some reason, heartbeat communication was lost as soon as the bridge was activated. I got the cluster running with xend by moving the heartbeat to a different interface. This is less than ideal because that interface is attached to a network that is also used for different things and has other hosts attached to it, but since this is only a test cluster, that's acceptable. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to diagnose stonith death match?
On Wed, 2013-01-09 at 13:15 +1100, Andrew Beekhof wrote: IIRC, part of the activation involves tearing down the normal interface and creating the bridge. At this point the device heartbeat was talking to is gone. I hadn't thought of that, because afterwards, ethX looks exactly the same as it did before, same IP and other settings. It just has xenbrX attached to it. But I admit I don't know exactly what happens there. I got the cluster running with xend by moving the heartbeat to a different interface. Having heartbeat start after the bridge is created _should_ also work. Obviously that can't work if xend is a cluster resource. I suppose xend could be started outside the cluster before heartbeat, but then I don't get to have it monitored by Pacemaker. So this will be in the archives as a warning to people running clusters for Xen virtual machines (or anything else that sets up bridged networking). In my case, the only solution is to use an interface for heartbeat that is not touched by Xen networking. I suppose people who are using something other than bridged networking may not have this issue either. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to diagnose stonith death match?
On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote: The problem is that either node can come up and run all the resources, but as soon as I bring the other node online, it briefly looks normal, but as soon as the stonith resource starts, the currently running node gets fenced and the new node takes over all the resources. Then the fenced node comes up, fences the other node and takes over, etc. Death match. After spending way too much time on this, I finally gave up, completely removed and reinstalled heartbeat and pacemaker, cleared out the contents of /var/lib/heartbeat/crm, and reconfigured the cluster from scratch. It is now working. I don't have all the resources in yet, but I believe it will work properly when I am done. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Some novice questions?
On Tue, 2013-01-01 at 14:58 +0330, Ali Masoudi wrote: Is it mandatory to use same ha.cf on both nodes? I don't think it is absolutely mandatory, but it is best practice. Unless you really know what you are doing, you can run into difficulties getting heartbeat to work properly if the ha.cf files are different. if names of network interfaces are differenet, what is best to do? I have never run a cluster where this was so. Since my hardware is identical on both nodes for all of my clusters, so are the network interface names. I imagine you could get it to work if the ha.cf files were the same except for the network interface names, but I haven't tried this. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Some novice questions?
On Mon, 2012-12-31 at 15:09 +0330, Ali Masoudi wrote: ucast eth3 192.168.50.17 If you are using ucast, then you need one line for each node's IP in the ha.cf file. Either that or different ha.cf files on each node. What is needed is the IP of the other node, but heartbeat is smart enough to ignore ucast IP's that refer to the node it is running on, so the usual practice is to include two ucast lines, one for each node's IP, so that you can use the same ha.cf file on both nodes. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] how to diagnose stonith death match?
I did some reconfiguration of the NICs and IP addresses on my 2-node test cluster (running heartbeat and Pacemaker on CentOS 5, slightly old versions but they have been working fine up to now on this and several other clusters). I am sure that the NIC configuration is correct and that the CIB has the correct modified data in it. Also the ha.cf file is correct. (I even tried switching from bcast to ucast, but that did not change the behavior). The problem is that either node can come up and run all the resources, but as soon as I bring the other node online, it briefly looks normal, but as soon as the stonith resource starts, the currently running node gets fenced and the new node takes over all the resources. Then the fenced node comes up, fences the other node and takes over, etc. Death match. What I am looking for is just a hint about how to diagnose this. I have tried looking in the log file, but as everyone knows, those logs are incredibly voluminous, so I would like a hint about what to look for to diagnose this. Thank you, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Custom resource agent script assistance
On Thu, 2011-12-01 at 13:25 -0400, Chris Bowlby wrote: Hi Everyone, I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster This doesn't really address your specific question, but I got dhcpd to work by using the ocf:heartbeat:anything RA. primitive dhcp ocf:heartbeat:anything \ params binfile=/usr/sbin/dhcpd cmdline_options=-f -cf /vmgroup2/rep/dhcpd.conf -lf /vmgroup2/rep/dhcpd/dhcpd.leases \ op monitor interval=10 timeout=50 depth=0 \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s \ meta target-role=Started The -cf and -lf arguments are just to ensure that the config file and the leases file are located within a DRBD-replicated partition. No doubt 10 people will surface to explain why this is a horrible way to do it, but it does work. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Monitoring only across WAN
On Mon, 2011-06-20 at 17:47 +0800, Emmanuel Noobadmin wrote: The objective is to achieve sub minute monitoring of services like httpd and exim/dovecot so that I can run a script to notify/SMS myself when one of the machines fails to respond. Right now I'm just running a cron script every few minutes to ping the servers are but the problem is that I discovered that the server could respond to pings while services are dead to the world. It sounds like HA may be the wrong tool for what you want. You might be better off with some type of monitoring/notification tool such as Nagios. Those tools can do more than just ping, they can connect to the web server and verify that it is operating properly. While it might be possible to make the cluster software work over a WAN, it was never really designed to operate that way. Ideally you need more than one connection between nodes and a way for one node to fence the other (STONITH) in order for the cluster software to work properly. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cat /dev/ttyS0
On Mon, 2011-05-23 at 13:59 -0700, Hai Tao wrote: this might not be too close to HA, but I am not sure if someone has seem this before: I use a serial cable between two nodes, and I am testing the heartbeat with : server2$ cat /dev/ttyS0 server1$ echo hello /dev/ttyS0 instead of receiving hello on server2, I see some hashed code there. Does someone have an idea why I do not receive the hello in clear text? This normally means there is something wrong with your tty settings (see man stty). Either your settings at each end do not match, or the settings you are using will not work with the cable you have. Or perhaps the pinouts on the cable you are using are incorrect, but if you are getting something across, it's more likely stty settings than cable pinouts. I am not an expert on serial communications so this is about all the help I can give, but I do know that seeing garbage on a serial tty usually means the stty settings are wrong. I can also say that I have used serial heartbeats in the past with success, but some things (like certain USB-to-serial adapters), I could just never get to work. But I've never had any trouble getting a serial cable between two on-board serial ports to work. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?
On Mon, 2011-04-04 at 11:44 -0500, Neil Aggarwal wrote: From what I can figure out from the ha.cf file, heartbeat uses ping to tell if the peer is up. Not really. It uses special heartbeat packets to tell if the peer is up. Ping is used to tell the difference between a dead peer and a bad NIC or cable. If the NIC or cable is bad, the remote peer would not respond, but also neither would any of the ping targets. The other node would see its remote node dead, but the ping targets alive, so it would know to take over resources. This is a crude method of avoiding split brain compared to a real STONITH device, but it works surprisingly well in a number of situations. We ran a number of critical services on heartbeat-v1 clusters for years until we switched over to using Pacemaker last year when it became obvious that no one is supporting heartbeat-v1 configurations any more (we were dragged kicking and screaming into the much more complicated but also much more flexible and reliable world of Pacemaker). I want to switch the virtual IP if the ldirectord process is not running or locked up. That may happen even if the network card is ok. Is there a way to do that? You don't say whether or not you are using Pacemaker. If you are, then you can set up ldirectord as a Pacemaker resource and let Pacemaker handle the monitoring. If you are not doing that, then you will need something external to do the monitoring. That is basically a limitation of heartbeat-v1 resources in general; the individual resources are not monitored, so it is possible to get into a situation where one or more resources are hung or crashed, but the heartbeat is still running so no failover occurs. The only solutions to that involve some sort of external monitor outside heartbeat (of which Pacemaker seems to be the recommended one). --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?
On Mon, 2011-04-04 at 13:38 -0500, Neil Aggarwal wrote: crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.9.101 cidr_netmask=32 \ op monitor interval=30s Does that mean heartbeat is being used to detect when to move the IP address to the standby server? Heartbeat is only used to detect situations that require a complete failover of all resources, i.e. to make sure the other node(s) is still up and running the cluster software. It is Pacemaker's job to monitor individual resources and move/restart them if necessary. This may be a bit oversimplified and I'm sure the cluster guys will jump in and correct this if I said something wrong. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with DRBD-based resource
On Wed, 2010-12-29 at 12:56 +0100, Dejan Muhamedagic wrote: Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 op=vmgroup1:0_stop_0 ) Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: rsc:vmgroup1:0:30: stop Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 op=vmgroup2:0_stop_0 ) Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: rsc:vmgroup2:0:31: stop Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed vmgroup1:0:stop process 8088 exited with return code 6. Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6, cib-update=36, confirmed=true) not configured Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed vmgroup2:0:stop process 8089 exited with return code 6. No messages from the drbd RA? Nothing that I can see. It looks, however, like the same kind of error is occurring with many or all of the resources. I have attached the complete halog entries for the time period in question. This smells like a bug found in 1.0.9 which should've been fixed a while ago: http://developerbugs.linux-foundation.org/show_bug.cgi?id=2458 After reading that report, it doesn't look like the same problem to me, but I will freely admit that the logs are hard for me to interpret. There are entries like this showing what appear to be the correct parameters: Dec 28 09:19:13 vmserve.scd.ucar.edu lrmd: [7514]: notice: max_child_count (4) reached, postponing execution of operation monitor[10] on ocf::LVM::DRBDVG0 for client 7518, its parameters: volgrpname=[DRBDVG0] CRM_meta_timeout=[2] crm_feature_set=[3.0.1] by 1000 ms If it's not a resource problem (i.e. drbd), please either reopen the bugzilla above or open a new one if it looks like a different problem. Don't forget to attach hb_report. If you don't see anything obvious in the attached more complete log, I will gladly do so. In the meantime, I may have to downgrade pacemaker so that I can get my cluster back. We are running in non-HA mode right now. --Greg Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: hacluster /usr/lib64/heartbeat/ipfail Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Pacemaker support: respawn Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: File /etc/ha.d//haresources exists. Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: This file is not used because crm is enabled Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: hacluster /usr/lib64/heartbeat/ccm Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: hacluster /usr/lib64/heartbeat/cib Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: root /usr/lib64/heartbeat/lrmd -r Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: root /usr/lib64/heartbeat/stonithd Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: hacluster /usr/lib64/heartbeat/attrd Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn directive: hacluster /usr/lib64/heartbeat/crmd Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: AUTH: i=1: key = 0xd472250, auth=0x2abe40ad76f0, authname=sha1 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Pacemaker support: false Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: ** Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Configuration validated. Starting heartbeat 3.0.2 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Heartbeat Hg Version: node: 7153d58dcb99ff4251449c5404754e26ee1af48e Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: heartbeat: version 3.0.2 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: Heartbeat generation: 1265221099 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth3 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth3 - Status: 1 Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: ping group heartbeat started. Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: ping group heartbeat started. Dec 28
Re: [Linux-HA] problem with DRBD-based resource
On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote: I updated one of my clusters today, and among other things, I updated from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related or not. Turns out it is. I downgraded the idle node to 1.0.9 and started heartbeat there. I then have a working cluster. I then tried disabling heartbeat on the 1.0.10 node, and got another mutual stonith which ends up with all the resources on the 1.0.9 node. Then I downgraded the other node to 1.0.9, and the cluster is now working again in HA mode. I now feel more confident that this is a bug in 1.0.10, so I will file a bugzilla. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problem with DRBD-based resource
I updated one of my clusters today, and among other things, I updated from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related or not. The problem is that I cannot get the cluster to come up clean. Right now all resources are running on one node and it is OK that way. As soon as I start heartbeat on the second node, it goes into a stonith death match. What I see is some failed actions involving trying to stop a DRBD resource group. Here is a log snippet: Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 op=vmgroup1:0_stop_0 ) Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: rsc:vmgroup1:0:30: stop Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 op=vmgroup2:0_stop_0 ) Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: rsc:vmgroup2:0:31: stop Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed vmgroup1:0:stop process 8088 exited with return code 6. Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6, cib-update=36, confirmed=true) not configured Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed vmgroup2:0:stop process 8089 exited with return code 6. In this example, vmgroup1 and vmgroup2 are DRBD resources, then set up as clones, which is the standard way to do this. Looks like this in crm shell: primitive vmgroup1 ocf:linbit:drbd \ params drbd_resource=vmgroup1 \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=60s role=Slave timeout=20s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s [...] ms ms-vmgroup1 vmgroup1 \ meta clone-max=2 notify=true globally-unique=false target-role=Started This has always worked fine until today. Any ideas what I can do to further debug this? I am running on CentOS 5.5 using the clusterlabs repos. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] strange crm behavior
On Tue, 2010-12-21 at 12:09 +0100, Dejan Muhamedagic wrote: Could it be that the status shown below is part of a node status which is not in the cluster any more? Or a node which is down? No, that is not possible. This is a two-node cluster and both nodes have been up for many days and are both currently running resources. There have never been any other nodes that were part of this cluster. --Greg [r...@vmserve sbin]# crm resource cleanup VM-paranfsvm Cleaning up VM-paranfsvm on vmserve2.scd.ucar.edu Cleaning up VM-paranfsvm on vmserve.scd.ucar.edu [r...@vmserve sbin]# cibadmin -Q | grep VM-paranfsvm lrm_resource id=VM-paranfsvm type=Xen class=ocf provider=heartbeat lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:7;15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=118 rc-code=7 op-status=0 interval=0 last-run=1292610207 last-rc-change=1292610207 exec-time=250 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_resource id=VM-paranfsvm type=Xen class=ocf provider=heartbeat lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:7;17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=67 rc-code=7 op-status=0 interval=0 last-run=1292610208 last-rc-change=1292610208 exec-time=240 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_rsc_op id=VM-paranfsvm_start_0 operation=start crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=68 rc-code=0 op-status=0 interval=0 last-run=1292610209 last-rc-change=1292610209 exec-time=2540 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_rsc_op id=VM-paranfsvm_monitor_1 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=69 rc-code=0 op-status=0 interval=1 last-run=1292610213 last-rc-change=1292610213 exec-time=290 queue-time=0 op-digest=e507fbd4a0eb54917c1cb1e51bafbd7f/ lrm_rsc_op id=VM-paranfsvm_stop_0 operation=stop crm-debug-origin=do_update_resource crm_feature_set=3.0.1 transition-key=145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=70 rc-code=0 op-status=0 interval=0 last-run=1292610224 last-rc-change=1292610224 exec-time=5690 queue-time=30 op-digest=d84dd793335cf339b4757a9041f005ac/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] strange crm behavior
On Mon, 2010-12-20 at 12:40 +0100, Dejan Muhamedagic wrote: That's strange. resource cleanup should definitely remove the LRM (status) part. Can you please try again and then do: # cibadmin -Q | grep VM-paranfsvm It seems like it is not removing status info for old removed resources: [r...@vmserve sbin]# crm resource cleanup VM-paranfsvm Cleaning up VM-paranfsvm on vmserve2.scd.ucar.edu Cleaning up VM-paranfsvm on vmserve.scd.ucar.edu [r...@vmserve sbin]# cibadmin -Q | grep VM-paranfsvm lrm_resource id=VM-paranfsvm type=Xen class=ocf provider=heartbeat lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:7;15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=118 rc-code=7 op-status=0 interval=0 last-run=1292610207 last-rc-change=1292610207 exec-time=250 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_resource id=VM-paranfsvm type=Xen class=ocf provider=heartbeat lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:7;17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=67 rc-code=7 op-status=0 interval=0 last-run=1292610208 last-rc-change=1292610208 exec-time=240 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_rsc_op id=VM-paranfsvm_start_0 operation=start crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=68 rc-code=0 op-status=0 interval=0 last-run=1292610209 last-rc-change=1292610209 exec-time=2540 queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/ lrm_rsc_op id=VM-paranfsvm_monitor_1 operation=monitor crm-debug-origin=build_active_RAs crm_feature_set=3.0.1 transition-key=147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=69 rc-code=0 op-status=0 interval=1 last-run=1292610213 last-rc-change=1292610213 exec-time=290 queue-time=0 op-digest=e507fbd4a0eb54917c1cb1e51bafbd7f/ lrm_rsc_op id=VM-paranfsvm_stop_0 operation=stop crm-debug-origin=do_update_resource crm_feature_set=3.0.1 transition-key=145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559 transition-magic=0:0;145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559 call-id=70 rc-code=0 op-status=0 interval=0 last-run=1292610224 last-rc-change=1292610224 exec-time=5690 queue-time=30 op-digest=d84dd793335cf339b4757a9041f005ac/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple stonith and Heartbeat 2.1.4
On Thu, 2010-11-18 at 14:46 +0100, Sébastien Prud'homme wrote: I'm using meatware as a second stonith resource I'm doing this and it works fine. Unfortunately after several tests, i didn't find a way to make it work: only the first stonith ressource is used (and fails), the cluster enter in a loop (trying to use only the first stonith ressource) and no ressource migration is done. You did run the meatclient, right? What was the command you used and the output of it? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debugging resource configuration
On Wed, 2010-11-03 at 11:13 +0100, Dejan Muhamedagic wrote: ERROR with rpm_check_debug vs depsolve: heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64 Complete! (1, [u'Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise %20Linux%205component=yum']) Hardly an RPM expert here, but didn't it ask you to report a problem with yum? I suppose I could go through the motions of doing this, but Red Hat will likely (and correctly) point out that clusterlabs is a third-party repo. Since the RHEL package (or at least the downstream CentOS version of it) will install and run just fine on systems not using the clusterlabs repo, this really doesn't seem to be a Red Hat or CentOS problem (at least from their point of view). In any case I have worked around the problem by using a vanilla CentOS virtual machine to run ldirectord instead of trying to do it on my Pacemaker host OS. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debugging resource configuration
On Tue, 2010-11-02 at 11:11 +0100, Dejan Muhamedagic wrote: If you're using resource-agents, the package should be named ldirectord not heartbeat-ldirectord. The two packages should also have the same release numbers, probably something like 1.0.3-x. I figured as much. But there appears to be a problem with the ldirectord package from clusterlabs, as explained in an earlier message from Masashi Yamaguchi yamag...@gmail.com: think ldirectord rpm package's spec for RedHat/CentOS is inconsistent. $ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm config(ldirectord) =3D 1.0.3-2.el5 heartbeat-ldirectord ldirectord =3D 1.0.3-2.el5 $ rpm -qp --conflicts ldirectord-1.0.3-2.el5.x86_64.rpm heartbeat-ldirectord $ ldirectord package PROVIDES heartbeat-ldirectord and CONFLICTS with heartbeat-ldirectord. ldirectord package' spec has self-conflict. This is a patch for the problem. --- resource-agents.spec +++ resource-agents.spec @@ -71,7 +71,6 @@ Requires: %{SSLeay} perl-libwww-perl ipvsadm Provides: heartbeat-ldirectord Obsoletes: heartbeat-ldirectord -Conflicts: heartbeat-ldirectord Requires: perl-MailTools %if 0%{?suse_version} Requires: logrotate I installed the modified ldirectord package successfully. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debugging resource configuration
On Tue, 2010-11-02 at 22:24 +0100, Lars Ellenberg wrote: ldirectord package PROVIDES heartbeat-ldirectord and CONFLICTS with heartbeat-ldirectord. ldirectord package' spec has self-conflict. This is a patch for the problem. --- resource-agents.spec +++ resource-agents.spec @@ -71,7 +71,6 @@ Requires: %{SSLeay} perl-libwww-perl ipvsadm Provides: heartbeat-ldirectord Obsoletes: heartbeat-ldirectord -Conflicts: heartbeat-ldirectord Requires: perl-MailTools %if 0%{?suse_version} Requires: logrotate That's incorrect, to the best of my knowledge. Though I'm certainly not an RPM wizard. That seems to be standard procedure for package name changes. package used to be named some-package, package is renamed to other-package. other-package now provides, obsoletes, and conflicts with some-package. If you have a good pointer to some rpm packaging doc saying otherwise, please let us know. I do not claim to be an RPM expert either, I was only repeating what someone else said. According to his report, modifications were needed to the ldirectord package in order for it to install. What I do know is that I cannot install it on my CentOS 5 system even though I have made sure that heartbeat-ldirectord is not already installed. Here is the result: [r...@vmserve2 woods]# yum install ldirectord.x86_64 Loaded plugins: dellsysid, fastestmirror Loading mirror speeds from cached hostfile * addons: mirror.ubiquityservers.com * extras: mirrors.versaweb.com Setting up Install Process Resolving Dependencies -- Running transaction check --- Package ldirectord.x86_64 0:1.0.3-2.6.el5 set to be updated -- Finished Dependency Resolution Dependencies Resolved Package Arch Version Repository Size Installing: ldirectordx86_641.0.3-2.6.el5 clusterlabs 55 k Transaction Summary Install 1 Package(s) Upgrade 0 Package(s) Total download size: 55 k Is this ok [y/N]: y Downloading Packages: ldirectord-1.0.3-2.6.el5.x86_64.rpm | 55 kB 00:00 Running rpm_check_debug ERROR with rpm_check_debug vs depsolve: heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64 Complete! (1, [u'Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise %20Linux%205component=yum']) [r...@vmserve2 woods]# rpm -q heartbeat-ldirectord package heartbeat-ldirectord is not installed I can install heartbeat-ldirectord, but unsurprisingly it does not work properly with Pacemaker. For now I gave up installing this on the Pacemaker box, and instead created a virtual machine, installed heartbeat-ldirectord on it, and wrote myself a crude monitoring script. This setup is working. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debugging resource configuration
On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote: Just a shot in the dark here kind of ... but I know that when I had this type of problem with a stonith device it was timeout related. You could try boosting your timeouts all around, or even check what # time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start reports back. [r...@vmx1 log]# time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start real0m0.261s user0m0.188s sys 0m0.068s (after which it is working fine; I can connect to the virtual service and get properly redirected to the real server). I am convinced now that either there is a bug in the resource agent so that the monitor process just doesn't work right, there is something obvious and stupid that I just don't see in my configuration that is wrong, or else the ldirectord script that I have (which came from the CentOS heartbeat-ldirectord package) is incompatible with what the resource agent is expecting. If timeouts aren't it, I would start breaking out parts of the cluster config and trying it again until it works I still haven't been able to make it work, but I have eliminated a number of variables. I got rid of all the IPAddr resources, order statements, and colocation statements. All that is there now that is relevant to ldirectord is: primitive ldirectord ocf:heartbeat:ldirectord \ op start interval=20s timeout=15s \ op stop interval=20s timeout=15s \ op monitor interval=20s timeout=20s (I have actually tried different interval and timeout numbers but the result is always the same). That's it. Then I configured the eth1:0 interface manually to correspond with the IP address of the virtual server configured in ldirectord.cf, and ran crm resource ldirectord start. The result is the same start, stop, FAILED scenario repeated. The logs appear to show that it is running the status check every 2 seconds or so, despite my interval and timeout settings: [Fri Oct 29 10:11:06 2010|ldirectord.cf|19214] Starting Linux Director v1.186-ha-2.1.3 as daemon [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Added virtual server: 128.117.64.127:25 [...] [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Quiescent real server: 128.117.64.123:25 (128.117.64.127:25 ) (Weight set to 0) [...] [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Restored real server: 128.117.64.123:25 (128.117.64.127:25) (Weight set to 1) (there are similar pairs of entries for all the declared real servers) So far so good, now comes the problem: [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Invoking ldirectord invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] ldirectord for /etc/ha.d/ldirectord.cf is running with pid: 19216 [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Exiting from ldirectord status [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Invoking ldirectord invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] ldirectord for /etc/ha.d/ldirectord.cf is running with pid: 19216 [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Exiting from ldirectord status [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Invoking ldirectord invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] ldirectord for /etc/ha.d/ldirectord.cf is running with pid: 19216 [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Exiting from ldirectord status [Fri Oct 29 10:11:08 2010|ldirectord.cf|19416] Invoking ldirectord invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf stop The status check should have succeeded, but the monitor process thinks it failed. Also as can be seen, the status check is repeated only 2 seconds later. The corresponding log for lrmd shows: Oct 29 10:11:05 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5526: start Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: RA output: (ldirectord:start:stdout) /usr/sbin/ldirector d /etc/ha.d/ldirectord.cf start Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: Managed ldirectord:start process 19203 exited with retur n code 0. Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5527: start Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: operation start[5527] on ocf::ldirector d::ldirectord for client 4845, its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already running. Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916: postponing all ops on resource ldirectord by 1000 ms Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: operation start[5527] on ocf::ldirectord::ldirectord for client 4845, its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already running. Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910: operations on
Re: [Linux-HA] ldirectord on CentOS 5
On Fri, 2010-10-29 at 12:09 +0900, Masashi Yamaguchi wrote: I think ldirectord rpm package's spec for RedHat/CentOS is inconsistent. $ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm config(ldirectord) =3D 1.0.3-2.el5 heartbeat-ldirectord ldirectord =3D 1.0.3-2.el5 $ rpm -qp --conflicts ldirectord-1.0.3-2.el5.x86_64.rpm heartbeat-ldirectord $ ldirectord package PROVIDES heartbeat-ldirectord and CONFLICTS with heartbeat-ldirectord. ldirectord package' spec has self-conflict. This is a patch for the problem. --- resource-agents.spec +++ resource-agents.spec I don't quite get this. Your patch is for resource-agents or ldirectord packages? I presume the idea is that you get the src rpm, extract it, apply the patch, and rebuild the RPM? (I haven't been able to find the src rpm for ldirectord). If I understand this correctly, it looks like a bug that should be fixed in the clusterlabs repo. Do they have a place to officially report bugs? I did try extracting the ldirectord script from the clusterlabs ldirectord package, and it segfaults, so I suspect I really have to find a way to install the entire package in order to use that script and get the heartbeat/pacemaker monitoring to work properly. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ldirectord on CentOS 5
I currently have an old heartbeat v1 cluster that I am moving to a newer Pacemaker/heartbeat v3 cluster. That is, I am moving the functionality of the old cluster to the new one so that the old one can be phased out. The new cluster is running all the latest stuff from the clusterlabs repo under CentOS 5.5 One thing the old one does is run Linux Virtual Server and ipvsadm to farm out incoming SMTP connections to multiple mail processing nodes (virus scanning, spamassassin scanning, alias lookup, etc.). I would like to have the new cluster do this. From what I have read, it appears that the right way to do this is to install ldirectord and set up an ldirectord resource in Pacemaker. The problem is that I can't get ldirectord to install. There is an ldirectord package in the clusterlabs repo, and a heartbeat-ldirectord package in the CentOS-extras repo, and they conflict. Neither one is installed now but I still get this error when I try to install ldirectord: ERROR with rpm_check_debug vs depsolve: heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64 Complete! (1, [u'Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise %20Linux%205component=yum']) The same thing happens if I disable the extras repo, and even if I do yum clean all first. If instead I try to install heartbeat-ldirectord and disable the clusterlabs repo (which might result in a package that doesn't work right in any event), I get a different error: Transaction Check Error: file /usr/lib/ocf/resource.d/heartbeat/ldirectord from install of heartbeat-ldirectord-2.1.3-3.el5.centos.x86_64 conflicts with file from package resource-agents-1.0.3-2.6.el5.x86_64 Is going to the source the only way to get ldirectord to install on this system, or has someone else seen this before and know of a workaround? Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ldirectord on CentOS 5
The same thing happens if I disable the extras repo, and even if I do yum clean all first. If instead I try to install heartbeat-ldirectord and disable the clusterlabs repo (which might result in a package that doesn't work right in any event), I get a different error: Transaction Check Error: file /usr/lib/ocf/resource.d/heartbeat/ldirectord from install of heartbeat-ldirectord-2.1.3-3.el5.centos.x86_64 conflicts with file from package resource-agents-1.0.3-2.6.el5.x86_64 Try to get rid of the file if it is still there. Try it again afterwards. I am a little confused. Can I actually install the CentOS extras heartbeat-ldirectord package from CentOS and expect it to work with all the clusterlabs stuff? The clusterlabs also has an ldirectord package. The situation I have now is that the resource agent script is present (it's in the resource-agents package), but the actual ldirectord script is not. So I actually copied the /usr/sbin/ldirectord binary from another CentOS 5 machine that doesn't have clusterlabs but does have heartbeat-ldirectord, and then tried to configure an ocf:heartbeat:ldirectord resource, but when I did the commit, I got this error reported by crm_mon: Failed actions: ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5, status=complete): not installed ldirectord_monitor_0 (node=vmx1.ucar.edu, call=79, rc=5, status=complete): not installed Seems like there is something in the package besides just the ldirectord script that is needed. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ldirectord on CentOS 5
On Thu, 2010-10-28 at 14:52 -0600, Greg Woods wrote: I am a little confused. I was actually more confused than I thought. When I got this error: Failed actions: ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5, status=complete): not installed ldirectord_monitor_0 (node=vmx1.ucar.edu, call=79, rc=5, status=complete): not installed I carefully inspected the logs and determined that what this really meant was that ldirectord couldn't find the config file (it was in a different place than it was expecting to find it). So I was actually able to copy over the ldirectord script from another system and get an ldirectord resource to start, once I put the config file in the correct place and created an IPAddr resource for the virtual service address. Running ipvsadm shows that it is working as expected (the virtual and real servers are correctly reported) and ifconfig shows that the virtual service address is present. But when I try to connect to the virtual service, I get connection refused although I can connect to the real servers just fine. This is a problem that is most likely outside the HA software and hopefully I will be able to solve it (I did check firewall rules first). I still would like to find a solution to the original question though (how to install an ldirectord package), just for the purposes of making it easier to keep things updated going forward. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] debugging resource configuration
This is a continuation of trying to get ldirectord working under pacemaker. I have a working installation of ldirectord. I know this because if I manually configure the eth0:0 pseudo-interface with the virtual server address, and manually start ldirectord with # /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start ...then everything works. I can connect to the virtual service address and port, and I get properly redirected to one of the real servers. ipvsadm shows normal output. All looks good. However, if I try to start the ldirectord resource, it starts, then fails, then starts, then fails, etc. This will continue until I issue a resource ldirectord stop command in the CRM shell. So it has to be something with how I configured it, but I'm damned if I can figure it out. Here is what I have that involves this resource: primitive ldirectord ocf:heartbeat:ldirectord \ op start interval=20 timeout=15 \ op stop interval=20 timeout=15 \ op monitor interval=20 timeout=20 \ colocation vdir-ipi-with-ldirectord inf: vdir-ipi ldirectord order vdir-ipi-before-ldirectord inf: vdir-ipi ldirectord The vdir-ipi is an IPAddr resource that will start fine and results in the eth0:0 alias interface being configured and brought up. When I issue a resource start ldirectord command from the crm shell, what I get from lrmd is repeats of this sequence: Oct 28 18:12:24 vmx1.ucar.edu lrmd: [4842]: info: rsc:vdir-ipi:5464: start Oct 28 18:12:24 vmx1.ucar.edu lrmd: [4842]: info: Managed vdir-ipi:start process 4923 exited with return code 0. Oct 28 18:12:25 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5466: start Oct 28 18:12:25 vmx1.ucar.edu lrmd: [4842]: info: RA output: (ldirectord:start:stdout) /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start Oct 28 18:12:26 vmx1.ucar.edu lrmd: [4842]: info: Managed ldirectord:start process 5103 exited with return code 0. Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5467: start Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: operation start[5467] on ocf::ldirectord::ldirectord for client 4845, its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already running. Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916: postponing all ops on resource ldirectord by 1000 ms Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: operation start[5467] on ocf::ldirectord::ldirectord for client 4845, its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already running. Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910: operations on resource ldirectord already delayed Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: Managed ldirectord:start process 5221 exited with return code 0. Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5468: stop Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: Managed ldirectord:stop process 5226 exited with return code 0. Oct 28 18:12:28 vmx1.ucar.edu lrmd: [4842]: WARN: Managed ldirectord:monitor process 5265 exited with return code 7. Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: cancel_op: operation monitor[5469] on ocf::ldirectord::ldirectord for client 4845, its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[2] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] cancelled Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5470: stop Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: Managed ldirectord:stop process 5296 exited with return code 0. And then it repeats: Oct 28 18:12:31 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5471: start etc. How can I figure out what I have done wrong here? Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat with postgresql
On Fri, 2010-10-22 at 18:32 +0200, Andrew Beekhof wrote: if you're just using v1 - thats not a cluster, thats a prayer. Then God must answer my prayers, because I have been using some simple heartbeat v1/DRBD clusters for YEARS, for critical services like DNS. They have worked flawlessly and always failed over properly when individual servers developed problems or had to be taken offline for maintenance. I work in an environment where four or five nines are not required. As I see it, heartbeat v1 is only suitable in situations where you have a small number of resources, and the resource start order can be defined strictly linearly. In those cases, it works quite well, and I have a number of cases like that. This is my last contribution to this thread. It is obvious that some people have already decided that hearbeat v1 can't ever work. I am obviously not going to change anyone's mind, and the obvious fact is that you can no longer get any support for running heartbeat v1. I continue to use v1 on some clusters because it is already configured and working and has done what I needed it to do. My newer clusters are more complicated and therefore do use pacemaker. My old clusters are due to be replaced with virtual machines that run on the new clusters, so I expect in a few months I will have completely phased out v1 anyway. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat with postgresql
On Wed, 2010-10-20 at 08:13 +0200, Andrew Beekhof wrote: Um, maybe because heartbeat v1 has a much much much much less steep learning curve? I dispute that: http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple This addresses the fact that Pacemaker has many features that heartbeat v1 lacks. That is not in dispute, but it completely sidesteps the point that heartbeat v1 is sufficient for many uses and much easier to get working. I have not said that heartbeat v1 is better than pacemaker, only that it is easier to get working. The question was asked why would anyone want to use heartbeat v1. Here is one valid answer to that question. This point has been made on this list before by myself and others, and yet the question why would anyone want to use heartbeat v1 continues to be asked. I understand that nobody has any interest in developing heartbeat v1 any more. I accept this, I have moved on to v3 and Pacemaker. But that does not invalidate the answer to the original question. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat with postgresql
On Tue, 2010-10-19 at 10:01 -0600, Serge Dubrouski wrote: Any particular reason for using Heartbeat v1 instead of CRM/Pacemaker? Um, maybe because heartbeat v1 has a much much much much less steep learning curve? If you have a simple two-node cluster where one node is just a hot spare, it is way way way way easier to get it working with heartbeat v1. The first time I ever set up a high availability cluster, going in knowing nothing at all about it, I had a heartbeat v1 cluster working in a couple of days. Already having had considerable heartbeat v1 experience, it took me a couple of months to get a cluster working under heartbeat v3/Pacemaker. The pace of development is also high enough that the documentation often lags behind reality. That is not a criticism, I know how hard it is to keep the documentation up to date (I am already in that mode now with these new clusters; nobody else knows how they work so I can't even take a vacation now that I have some production services running on them, until I finish writing up some administration procedures). Yes, no doubt a Pacemaker cluster is far more flexible, but when one doesn't need all that flexibility and just wants a simple two-node HA cluster, the simplicity of heartbeat v1 is very attractive. This shouldn't be a big a mystery as it seems to be. Face up to it: learning and properly configuring Pacemaker is HARD, even for experienced sysadmins. And unless you need the additional flexibility that Pacemaker offers, it seems like a lot of extra effort. Will I use Pacemaker all the time in the future? Yes, because I have already put in the effort to learn and configure it. Setting up a new cluster, where I had an existing one to use as a template, took less than a week. But that first time, it was difficult, time consuming, and often frustrating. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Standby Node Refuses to Take Over
On Mon, 2010-09-27 at 09:43 -0700, Robinson, Eric wrote: I went so far as to turn off the primary, but the standby still never took over. Do you have STONITH configured? I have run into this too. The primary will not take over unless it is told somehow that the secondary is really and truly dead. If you have a real STONITH device such as IPMI, it will cause the secondary to forcibly power off the primary, providing the guarantee it needs to take over. On my test cluster where I don't have a working STONITH device yet, I use the meatware pseudo-device, which allows me to run a program on one node to inform it that the other node is really dead and that it is OK to take over. My old heartbeat v1 clusters used to work just fine without STONITH. DRBD split brain would occur every once in a while if both nodes lost power at the same time, but I could live with this. I wouldn't be surprised if the newer Pacemaker clusters pretty much require STONITH in order to work. Maybe someone in the know can confirm or deny this? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Standby Node Refuses to Take Over
On Mon, 2010-09-27 at 12:16 -0700, Robinson, Eric wrote: Not sure if you noticed in my previous message that I did physically power down the primary but the standby refused to take any action. Yes, I did notice that. My point is that I have noted on my clusters that simply powering it down (i.e. having it suddenly go away) may not be enough. That requires it to simply assume that the primary has gone away, and that it's not just a cable or NIC failure. STONITH is a method of *assuring* that the other node has gone away. It is designed to prevent both nodes from trying to run the same resources, which can have disastrous consequences. As I noted, I am not certain whether or not using STONITH is absolutely required now, but I have observed the same symptoms as you, and I ended up having to configure STONITH in order to get failovers to work properly. Usually though, if I explicitly set one node to standby, the other one will take over, because they can exchange messages that will convince the remaining node that the standby node will not be running any resources. So I really don't know if STONITH is your problem or would fix your problem. I only note that I have seen the same symptoms and that was how I fixed it for my clusters. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] node standby attribute and crm (SOLVED Partially)
On Fri, 2010-09-24 at 11:34 -0600, Greg Woods wrote: # crm node show vmserve2.scd.ucar.edu(16fde08d-b4b6-4550-adfb-b3aab83f706f): normal standby: off vmserve.scd.ucar.edu(6f5ced83-a790-4519-8449-3d4cf43275b0): normal standby: off On the second cluster: # crm node show vmx1.ucar.edu(62cf0a44-5d0f-475e-a0ac-689537f98f58): normal vmx2.ucar.edu(8ad9076e-c571-499b-91e9-4d513fd5be61): normal This difference can be corrected by running: # crm node attribute vmx1.ucar.edu set standby off # crm node attribute vmx2.ucar.edu set standby off But I don't recall having to do this before, so this does not explain why the difference occurred in the first place. I also don't know if this change will last across a reboot, but since it's part of the CIB, hopefully it will. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Adding DHCPD and NAMED as resources
On Thu, 2010-09-09 at 16:35 +0100, Daniel Machado Grilo wrote: Another way to do this is if you choose LSB instead of OCF category primitives. That way you just select the init script from your init.d and thats it. You do need to ensure that your init script is LSB compliant. This includes but is not limited to returning success when a stop is attempted when it is already stopped. Some init scripts I have seen do not do this right. Which means this may or may not work correctly. As an example, on CentOS 5.5 on a system that is running neither service right now: [r...@vmserve woods]# ps ax | fgrep named 10329 pts/7S+ 0:00 fgrep named [r...@vmserve woods]# ps ax | fgrep dhcpd 15451 pts/7S+ 0:00 fgrep dhcpd [r...@vmserve woods]# service named stop Stopping named:[ OK ] [r...@vmserve woods]# echo $? 0 [r...@vmserve woods]# service dhcpd stop [r...@vmserve woods]# echo $? 7 [r...@vmserve woods]# The named script does the right thing but the dhcpd script does not. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Adding DHCPD and NAMED as resources
On Wed, 2010-09-08 at 14:18 -0500, Bradley Leduc wrote: Am trying to add NAMED and DHCPD services as a resource on heartbeat-3.0.1-1.el5 cluster with no luck, I was wondering if anyone would know of an easy to do this. Any help would be great. Are you running pacemaker or just a heartbeat v1-style config? I've done it both ways. For v1 all I did was add dhcpd to haresources. For pacemaker, I use the ocf:heartbeat:anything resource since I couldn't find one specific to named or dhcpd anywhere. So I have config lines like this: primitive dhcp ocf:heartbeat:anything \ params binfile=/usr/sbin/dhcpd cmdline_options=-f \ op monitor interval=10 timeout=50 depth=0 \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s \ meta target-role=Started named works similarly. For v1, you may need to create a resource.d script that properly returns 0 if you try to stop a daemon that is already stopped; the standard init.d startup scripts don't always do this. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with static routes
On Sun, 2010-08-22 at 10:25 -0600, Greg Woods wrote: The basic problem is that when I reboot a node in my cluster, it comes back up without its static routes. I have determined through experimentation that it is the setup/teardown of Xen networking that is causing this. The static routes also go away if I just put a node on standby (which shuts down Xen networking), or even if I put a standby node back online. So I will take this to the xen-users list. It doesn't look like it has anything to do with the HA code itself. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problem with static routes
OS: CentOS 5.5 heartbeat: heartbeat-3.0.3-2.3.el5 (latest from clusterlabs) pacemaker: pacemaker-1.0.9.1-1.15.el5 (latest from clusterlabs) If it matters, this cluster is primarily used to run Xen virtual machines (xen-3.0.3-105.el5_5.5 kernel-2.6.18-194.11.1.el5xen latest from CentOS) I have been looking off and on for the source of this problem for quite a while without finding what is causing it. The basic problem is that when I reboot a node in my cluster, it comes back up without its static routes. Adding them back in manually works; they stay until the next reboot. These are defined in /etc/sysconfig/static-routes and are added by the network service at boot time. I have been able to pretty much rule out the boot process itself as the source of the problem. I added a netstat -r -n /tmp/static-routes command to the rc.local file which is the very last thing run at boot time and the routes are there. I have also tried putting nodes into standby (crm node standby) and back online, and the routes stay there through that. But once I log in after a reboot, the static routes are gone and I have to manually re-add them. I can probably work around this using a hideous kludge like having the rc.local file run a background job that sleeps for a couple of minutes, then adds the routes, but that doesn't really fix the issue and isn't guaranteed to work reliably (obviously high reliability is important or I wouldn't be using HA in the first place). Has anyone ever seen this before or have any clue where I can look to troubleshoot this? Thanks in advance, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Question about grouping with a clone inside group ?
On Wed, 2010-08-11 at 17:09 +0200, Alain.Moulle wrote: crm configure colocation coloc1 +INFINITY:group1 clone-fs1 This says that group1 and clone-fs1 have to be on the same machine. That prohibits starting clone-fs1 on a machine where group1 is not running. That isn't what you meant. I think all you need is the order directive to make sure clone-fs1 is started before group1. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] time to fork heartbeat?
On Wed, 2010-08-11 at 17:52 -0400, Peter Sylvester wrote: I do have to agree. I've actually been working for almost 4 business days now on trying to get Heartbeat and Pacemaker working together It took me six months to build a decent cluster, starting as one who was very experienced with heartbeat v1 master-hotspare pairs. But to be fair, heartbeat and pacemaker were not the only things I was learning, I also built the DRBD volumes on top of LVM volumes, then put LVM volumes on top of the DRBD volumes. That was very complicated to get working, but provides huge flexibility in that I can increase the size of DRBD volumes or individual file systems mounted on the DRBD volumes without major reconfiguration. Then it's on to Xen, heartbeat, and pacemaker. Eventually I spent quite a bit of time writing myself a management program that makes it easy to do things like add a new virtual machine (takes care of running the CRM shell to add the necessary config lines and that sort of thing). But it was incredibly difficult to get this working. Failing to configure a single resource properly can start a stonith death match and bring down the entire cluster. I do see the advantages of the extra flexibility, and I have begun using some of it. But there are a lot of use cases where a simple heartbeat v1 configuration is just fine and far easier to understand and implement. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Am I even on the right track here with Heartbeat?
On Wed, 2010-08-11 at 17:13 -0500, Dimitri Maziuk wrote: So is it not practical to run RHEL or CentOS 5.x where you'd get this version and several more years of disto maintenance? It's not practical if you want to have both distro maintenance or cluster support. I run CentOS 5.5, and there are maintained RPMs in the clusterlabs repo: http://www.clusterlabs.org/rpm/epel-5 I am running heartbeat 3.0.2 and I notice they have a 3.0.3 now. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] time to fork heartbeat?
On Thu, 2010-08-12 at 00:38 +0200, Dejan Muhamedagic wrote: On Wed, Aug 11, 2010 at 09:53:01PM -, Yan Seiner wrote: Heck, it really should just take two things: 1. IP of remote computer 2. Device to use Device? Bang, it just works. For many of us this would be sufficient. Hmm, I don't think HA can be that easy. Probably not, but that doesn't mean it has to be as hard as getting pacemaker to work currently is either. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Am I even on the right track here with Heartbeat?
On Wed, 2010-08-11 at 20:01 -0500, Dimitri Maziuk wrote: 1) there are installations where throwing in a package from 3rd party repo will cost you a lot. Like tech. support on a very very expensive piece of hardware. (Think giant hardon collider type of hardware.) Sure, there are some situations where this is not a good option. I only know it works for me. The other issue with packages from 3rd party repos is, of course 2) so how many times did you have to unfsck yum update conflicts so far? In this particular case: never. There are only a few RPMs in the clusterlabs repo and they all relate to heartbeat/corosync/pacemaker. That aside, the real problem for me is I haven't seen V2-style docs that actually made sense yet. I found the clusterlabs documents useful, but I too had to learn much through the school of hard knocks. This is fairly typical of open source projects; geeks want to code, not write documentation, so often the documentation does not keep up with the code. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Waiting for confirmation before failover on the backup server
On Fri, 2010-07-23 at 06:24 -0700, Mahadevan Iyer wrote: When using only heartbeat(no pacemaker) is there a way to do the following Setup a backup server such that when it tries to take over due to loss of connectivity with the main server, it waits for confirmation from an operator This is exactly what the meatware STONITH plugin is for. http://www.clusterlabs.org/doc/crm_fencing.html ..near the bottom of the page is the description of the meatware plugin. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Manual intervention on failover
On Thu, 2010-07-15 at 07:36 -0700, Pushkar Pradhan wrote: Hi, I have a strange requirement: I don't want failover to happen unless a operator says go ahead or a big timeout has occurred (e.g. 1 hour). I am using Heartbeat R1 style cluster with 2 nodes. Is this possible or do I need to write some custom plugin? This may not be the most elegant solution, but you could do this with the meatware stonith device which does exactly this; someone has to manually confirm that the other machine is really and truly dead before a failover will happen. This would be easy to set up if you are already using stonith, and a non-trivial learning curve otherwise. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 3 node cluster keeps failing after domU image is started
On Mon, 2010-06-28 at 10:47 +0200, Dejan Muhamedagic wrote: (drbd_xen2:1:probe:stderr) DRBD module version: 8.3.8userland version: 8.3.6 you should upgrade your drbd tools! I guess that you should follow this advice. Just one data point: I get this message in my logs too, but DRBD works fine anyway (using the native version from CentOS 5.5). --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 3 node cluster keeps failing after domU image is started
On Sun, 2010-06-27 at 03:02 -0700, Joe Shang wrote: Failed actions: drbd_xen2:1_start_0 (node=xen1.box.com, call=10, rc=5, status=complete): not installed This is one of the things that I don't like about heartbeat/pacemaker. A minor error (misconfiguring a single resource) can cause major problems (like a stonith death match that brings down the entire cluster). One thing I have seen with Xen VMs is that the default timeouts are too short. That may not be your particular problem, but you probably need to increase them anyway. This is an example of what I have: primitive VM-ldap ocf:heartbeat:Xen \ params xmfile=/etc/xen/ldap \ op monitor interval=10 timeout=120 depth=0 target-role=Stopped \ op start interval=0 timeout=60s \ op stop interval=0 timeout=120s \ meta is-managed=true target-role=Started Before I added the explicit op start and op stop timeouts, I woulod get failed stop or start operations and any attempt to fail over would start a death match. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 3 node cluster keeps failing after domU image is started
On Sun, 2010-06-27 at 07:57 -0700, Joe Shang wrote: Jun 27 10:51:49 xen1 lrmd: [3949]: info: RA output: (drbd_xen2:1:probe:stderr) 'xen2' not defined in your config. This looks like an error in your DRBD configuration. What is in drbd.conf? What does drbd-overview or drbdadm state all show? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 3 node cluster keeps failing after domU image is started
You could try making one of them primary: # drbdadm primary xen1 If that doesn't work, you may have encountered a split brain situation. In that case, you have to tell DRBD that it is OK for one of the machines to discard the data it has so that the other one can become primary. Look here: http://www.drbd.org/users-guide/s-resolve-split-brain.html One thing is for certain: you must resolve the low level DRBD problem before there is any chance of bringing your cluster software back up. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith external/rackpdu question
On Thu, 2010-05-20 at 18:30 +0100, Alexander Fisher wrote: I think I'll use IPMI and rackpdu in the same configuration. That is exactly what I will eventually try (assuming I ever get any time to work on my test cluster some more). It is clear that, no matter what I do, I cannot prepare for every possible thing that could happen. There will always be a scenario in which the stonith may fail to work. But the more unlikely that scenario is, the safer we are. In our case, the IPMI devices are connected via a crossover cable, so no switch failure can knock this out. There is still the possibility of a failure of the IPMI device (including via a complete power loss to one of the cluster nodes) or a cable failure. To insulate against that, I will use the PDU as a second stonith device. It will only be used if the IPMI stonith fails to work. As Alex pointed out, a switch failure (or switch port failure) could cause the PDU stonith to also fail, but the chances of that *and* a failure of IPMI happening at the same time is quite remote. At least I'm not having my entire set of cluster resources depending on a single ethernet cable, and there is something in place that will allow the remaining node to take over if one node suffers a complete power loss (the original scenario I was worried about that started me down the multiple stonith device path in the first place). --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith external/rackpdu question
Do you know that on APC PDUs, you can group outlets across several physical PDUs? I've got a bit more testing to do, but this seems to work ok. The plugin is configured to talk to just one outlet on one of the PDUs and the PDU does the rest. No, I didn't know you could do this. I will have to investigate it. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith external/rackpdu question
On Wed, 2010-05-05 at 13:29 +0200, Dejan Muhamedagic wrote: If these servers have a lights-out device and the power distribution is fairly reliable, that could be an alternative for fencing. They do have an IPMI device and it does work. I am trying to insulate against a failure of the NIC or cable by having a second stonith device. The cluster I have now is primarily for testing, but eventually we will be implementing critical services (e.g. DNS, e-mail, DHCP, and authentication) in virtual machines running on a cluster like this one, so part of the testing process is to learn what can and can't be done and where the potential gotchas are. I have discovered that if I simulate a cable failure by removing it, bad things happen because stonith cannot succeed. I would not want my DNS system to be vulnerable to a single cable failing, so I am looking for ways to guard against it. A complete power outage on one of the nodes also results in bad things when using IPMI. Again stonith cannot succeed and so the remaining server will not take over the resources. Yes, these are dual power supply servers so it is unlikely that something would happen that causes only one of the servers to completely lose power other than human error (possibly a motherboard failure as well?) but I am still looking to determine if there is a way to guard against this. Right now I have a meatware stonith device set up so that I can at least log in remotely and manually force the remaining server to take over, but I am looking for something more automatic. It would be nice to avoid those 3AM phone calls )-: I may take a shot at modifying the external/rackpdu stonith plugin at some point. We can't be the only ones in the world using dual power supply servers. I'll probably start by unplugging one of the power supplies on each server and making sure I understand how to use the plugin in single-outlet mode, then try doing the modifications to support dual outlets. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] stonith external/rackpdu question
We have a pair of servers in a cluster plugged into a pair of APC rack-mounted PDU's of the sort that could be controlled by this stonith plugin. My problem is that these are dual power supply servers, which means I would have to shut down two outlets that are on on two different PDUs to completely power off one of the nodes. Is it possible to use this stonith plugin to do that? The documentation on configuring outlet numbers (from crm ra info stonith:external/rackpdu) is a bit sparse; it isn't clear that what I want to do is even possible. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat doesn't see other nodes in cluster
On Wed, 2010-04-14 at 16:24 -0700, Stephen Punak wrote: Heartbeat appears to start just fine on all nodes, but none of them see each other. Any chance there is a firewall blocking the heartbeat packets? You'd still see them with wireshark, but they would be blocked from getting to the listening application. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] manual fencing
On Thu, 2010-04-08 at 17:46 +0200, Dejan Muhamedagic wrote: Does this help? $ crm ra info stonith:meatware Yes, it does! Thank you! --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble with CRM/XEN
On Wed, 2010-04-07 at 15:39 +0200, Andrew Beekhof wrote: I increased the timeout even further (to 120s instead of the minimum recommended 60) and it seems to be working. Curious though, because when it does work, the logs show that the entire stop operation, including a live migration, takes only about 7 seconds. It depends on what else the machine is doing. Are there any other Xen instances that might be migrating too? The test cluster currently has two Xen VM's, one is tied to a particular DRBD volume, so it has colocation and order constraints so that it must shut down, wait for the DRBD/Filesystem/LVM stack to fail over, and restart. Still, even that doesn't take more than 60 seconds. The other VM is stored on an NFS volume so that it can live migrate (allow-migrate=true). I have seen failures of the stop operation on both of them prior to increasing the timeout. Surely it's not handling the resources sequentially? That will be a disaster if we get to where I want to be going, which may involve dozens or even hundreds of VMs on a cluster. I realize I may have to adjust the timeout up higher for the simple reason that a few dozen VM's shutting down in parallel is going to take longer than one or two in parallel due to sharing of host OS resources, but hopefully the timeout won't be a linear function of the number of VMs. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] manual fencing
On Wed, 2010-04-07 at 12:13 +0200, Dejan Muhamedagic wrote: There's not much magic, you just configure two stonith resources and assign different priority, then they'll be tried in a round-robin fashion. For instance: primitive st-real stonith:ipmilan \ params ... priority=10 # it will try this one first primitive st-real stonith:meatware \ params ... priority=20 Yes, but if you don't KNOW that, then it's magic :-) Finding out what the parameters are for a given resource definition and what they do is the magic part; I often cannot find good documentation on this. The help features in the crm shell are useful; that often tells me what the parameters *are*, but it doesn't tell me what they *do*, or what a reasonable value might be. Fortunately we have the mailing list; thanks again. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] order directives and pacemaker update
On Tue, 2010-04-06 at 12:29 +0200, Dejan Muhamedagic wrote: There's a crm shell bug in 1.0.8-2 in the validation process. Either revert to the earlier pacemaker or apply this patch: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/422fed9d8776 OK, that's a relief. I chose to apply the patch because there are some features of 1.0.8-2 that I like (such as warning me when I have forgotten to explicitly set the start/stop timeouts). But after applying the patch, and figuring out how to compile Python into bytecode, it now works! Thank you very much. Presumably the next released version will have this patch in it so this is a one-time thing. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] manual fencing
I'm looking for a good way to deal with the total power drop case. I am using an iDrac 6 as a stonith device on a pair of Dell R710 servers. I tried the power drop test today by simply unplugging the power on one of the nodes. What happens in this case is that the attempt by the other node to stonith the dead node fails, so the other node refuses to take over resources. Since this is a fairly rare scenario (the machines have dual power supplies and use the same pair of power circuits, so the chances that one node completely loses power and the other doesn't are almost nonexistent, the way this could happen is a human accidentally powering off the wrong machine), I'd be willing to deal with it in manual mode as long as it can be done remotely. Is there any way to manually fence a node that I know is dead? I.e. to tell the still-running node I know the other node is dead even though you can't stonith it, please pretend the stonith succeeded and take over resources? Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] manual fencing
On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote: I think the solution is using UPS or PDU for STONITH device. That could improve things in some scenarios, but it does not completely solve the problem. The cluster is still vulnerable to having the entire power strip for one node unlpugged or turned off. No matter what the stonith device is, there is always the possibility of failure of the stonith device itself. My goal is to be able to recover from something like these remotely, before I can actually get there to correct the real problem. In fact, the chance that stonith will fail because one of the nodes has completely lost power due to hardware failure while the other one still has power is extremely small. They both have dual power supplies and they both use the same two circuits, so the only way that is at all likely where I could get into the state I am concerned about would be human error. Unfortunately we do have a lot of people with machine room access, which makes the possibility of someone powering off the wrong machine by mistake a real possibility. The chance that two power supplies would fail at the same time is remote. Unfortunately, human error is also possible using a controllable power strip as the stonith device; that doesn't really solve my problem. I do think I found something that might work. I'm not sure yet, but it looks like I can create a stonith:meatware resource in addition to the stonith:ipmilan resource. That would allow me to manually confirm that the powerless node is in fact dead and have the remaining node take over. That confirmation can be done by logging in to the live node remotely, so it will serve my needs if I can figure out the magic incantation to configure this correctly. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble with stonith [SOLVED]
On Sat, 2010-04-03 at 22:55 +0200, Dejan Muhamedagic wrote: That should've probably caused a connection timeout and this message: IPMI operation timed out... :( Was there such a message in the log? Now that I know to look for it, yes. So far I am having a great deal of difficulty sifting through the logs to find the messages that are relevant to whatever problem I am trying to solve, then interpreting what they actually mean when I do find them. I have a big learning curve still to climb. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble with CRM/XEN
On Sat, 2010-04-03 at 22:45 +0200, Dejan Muhamedagic wrote: I spoke too soon; now I am getting failures when stopping the Xen resources manually as well. I can't get both nodes online at the same time unless I disable stonith. There should be something in the logs. grep for lrmd and the lines containing the resource name. What I see is that the stop operation timed out. I increased the timeout even further (to 120s instead of the minimum recommended 60) and it seems to be working. Curious though, because when it does work, the logs show that the entire stop operation, including a live migration, takes only about 7 seconds. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] order directives and pacemaker update
Since I applied the most recent Pacemaker update last Friday (now running pacemaker-1.0.8-2.el5.x86_64 on CentOS 5), I can no longer enter order directives. I am using the exact same syntax that I used previously, and the syntax matches some existing directives, but the crm shell won't take it. Here is an example: crm(live)configure# primitive VM-cfvmserve ocf:heartbeat:Xen params xmfile=/etc/xen/cfvmserve op monitor interval=10 timeout=120 depth=0 op start interval=0 timeout=60s op stop interval=0 timeout=120s meta allow-migrate=true target-role=Stopped crm(live)configure# order vmnfs-before-cfvmserve inf: vmnfs-cl VM-cfvmserve crm(live)configure# verify ERROR: cib-bootstrap-options: attribute dc-version does not exist ERROR: cib-bootstrap-options: attribute cluster-infrastructure does not exist ERROR: cib-bootstrap-options: attribute last-lrm-refresh does not exist I don't know why I am getting these errors and I am not sure they are relevant to the problem I am seeing. But here's what happens later: crm(live)configure# commit element rsc_order: validity error : IDREF attribute first references an unknown ID vmnfs-cl vmnfs-cl is a clone resource of a file system mount. That resource is, and always has been, present. I also have a number of order directives that reference it that are already in the CIB and are working. Here's a snippet from crm configure show: primitive vmnfs ocf:heartbeat:Filesystem \ params directory=/vmnfs device=phantom.ucar.edu:/vol/dsgtest fstype=nfs \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s clone vmnfs-cl vmnfs \ meta target-role=Started order vmnfs-before-linstall inf: vmnfs-cl VM-linstall (this last is an example of one that is already present that works). Lastly, another part of the configure show output addressing the ERROR messages above: property $id=cib-bootstrap-options \ dc-version=1.0.8-3225fc0d98c8fcd0f7b24f0134e89967136a9b00 \ cluster-infrastructure=Heartbeat \ stonith-enabled=true \ last-lrm-refresh=1270484103 \ default-resource-stickiness= This is currently preventing me from being able to add any more virtual machines. The ones that are already in are working (including proper failovers and migration). So is this a bug in the new code or is it something I was doing wrong that is now being flagged that I just luckily got by with before? I can send the full configuration if that is deemed necessary but I would have to sanitize it to remove idrac passwords, local IP addresses, and so forth, so I won't do that unless it's the only way to figure this out. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble with stonith [SOLVED]
On Thu, 2010-04-01 at 15:38 -0600, Greg Woods wrote: node1# stonith -t ipmilan -F node2-param-file node2 This works both ways; the remote node reboots. So I should be able to rule out DRAC configuration issues. I have also checked, double-checked, and triple checked that the parameters in the stonith resources are specified correctly ..but I still missed one. It sure would be nice if the log would tell me something other than it failed when there is a mistake in the parameters. I literally looked at it 4 times before I noticed that the port number was wrong. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] trouble with CRM/XEN
I am having difficulty achieving a clean failover in a Pacemaker 1.0.7 cluster that is mainly there to run Xen virtual machines. I realize that nobody can tell me exactly what is wrong without seeing an awful lot of configuration detail; what I am looking for is more like some general methods I can use to debug this. In a nutshell: if I manually stop all the Xen resources first with a command like crm resource stop vmname), then failover works perfectly, and restarting them all manually after a failover also works and everything appears to be running fine. However, if I just stop heartbeat on node1, then restart it, then the attempts to stop Xen resources on node2 (preparatory to moving them back to node1) all fail, resulting in a stonith of node2 from node1. node1 will start up all the resources, but when node2 reboots, the process repeats: attempts to stop the Xen resources on node1 fail, resulting in a stonith of node1 from node2. Kind of a delayed death match. The only way to break the cycle is to manually stop the Xen resources before bringing a recovered node back online. Stop works fine when invoked manually, but fails when invoked automatically as a result of an attempt to move resources back to a recovered node. I have already tried setting allow-migrate=false on all the Xen resource definitions just to eliminate one more complication until I can figure this out. Any ideas on how I can debug this? The HA logs don't seem to be terribly helpful, they only indicate that the stop operation failed but say nothing as to why it failed. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] where is CRM 'default timeout' set?
In the process of trying to fix two other problems, I messed something up badly. Now when I go into the crm shell to edit the configuration, on verify I get a message like this for every one of my configured resources: WARNING: vm-ip1: default timeout 20s for start is smaller than the advised 90 The 20s is the same for every one but the advised value varies for different types of resources. I have never seen a message like this before so I have no idea why it suddenly started (although I did update the pacemaker package today, so that could be the reason why I haven't seen it before). Where is this 20s default timeout being set? What does this message *really* mean? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble with CRM/XEN
On Fri, 2010-04-02 at 13:02 -0600, Greg Woods wrote: In a nutshell: if I manually stop all the Xen resources first with a command like crm resource stop vmname), then failover works perfectly, I spoke too soon; now I am getting failures when stopping the Xen resources manually as well. I can't get both nodes online at the same time unless I disable stonith. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] where is CRM 'default timeout' set? [SOLVED]
On Fri, 2010-04-02 at 13:53 -0600, Greg Woods wrote: WARNING: vm-ip1: default timeout 20s for start is smaller than the advised 90 Found the answer for this one in the gossamer-threads for the pacemaker list. Should have thought of looking there first. For those who are struggling with the documentation as much as I am: it is recommended that the warnings like this be eliminated by setting the start and stop timeouts for the individual resource. This is done in the crm shell inside the primitive command that defines the resource. Inserting a line like this would get rid of the above warning: op start timeout=90s \ Easy once you know the magic incantation. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] trouble with stonith
I'm trying to get stonith to work on a two-node cluster using Dell iDrac. If I run stonith manually with a command like: node1# stonith -t ipmilan -F node2-param-file node2 This works both ways; the remote node reboots. So I should be able to rule out DRAC configuration issues. I have also checked, double-checked, and triple checked that the parameters in the stonith resources are specified correctly and match those from the param-files (almost, see below). However, when the cluster starts, the stonith resources fail to start. If I run a cleanup command to clear out the old status, here is what happens: Apr 01 15:20:13 vmserve.scd.ucar.edu lrmd: [13093]: debug: stonithd_receive_ops_result: begin Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug: Child process unknown_stonith-vm1_monitor [13094] exited, its exit code: 7 when signo=0. Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug: stonith-vm1's (ipmilan) op monitor finished. op_result=7 Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug: client STONITH_RA_EXEC_13093 (pid=13093) signed off Apr 01 15:20:13 vmserve.scd.ucar.edu lrmd: [5606]: WARN: Managed stonith-vm1:monitor process 13093 exited with return code 7. confirmed=true) not running One possible issue is that the param-files specify reset_method=power_cycle, but if I try to set this with crm edit, it says that reset_method is an unknown parameter: ERROR: stonith-vm1: parameter reset_method does not exist This inside crm immediately on exiting the editor. Any ideas on how I can repair this so that the stonith resources will start properly? Any other information I should provide? Thank you, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFS and DRBD
On one node, i can get all services to start(and they work fine), but whenever fail over occurs, there's nfs related handles left open thus inhibiting/hanging the fail over. more specifically, the file systems fails to unmount. If you are referring to file systems on the server that are made available for NFS mounting that hang on unmount (it's not clear from the above if your cluster nodes are NFS servers or clients), then you need to unexport the file systems first, then you can umount them. I handled this by writing my own nfs-exports RA that basically just does an exportfs -u with the appropriate parameters, and used an order line in crm shell to make sure that the Filesystem resource is ordered before the nfs-exports resource. The nfs-exports resource will export the file system on start, and unexport it on stop. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] getting stonith working [SOLVED]
On Mon, 2010-03-08 at 16:56 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 05, 2010 at 03:07:45PM -0700, Greg Woods wrote: Partially solved, anyway. Glad you got it solved, but why do you say partially? Because I managed to get it working without ever figuring out exactly what it was I had done wrong. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] getting stonith working
I am in the process of climbing the learning curve for Pacemaker. I'm using RPMs from clusterlabs on CentOS 5: heartbeat-3.0.2-2.el5 pacemaker-1.0.7-4.el5 It has been a long hard struggle, but I have mostly gotten my two-node cluster working. But I've hit a wall trying to get stonith to work. I used these commands (sanitized) in the CRM shell: crm(live)configure# primitive stonith-vm1 stonith:ipmilan params auth=straight hostname=node1.fqdn ipaddr=** login=* password=* priv=admin port=23 crm(live)configure# location nosuicide-vm1 stonith-vm1 rule -inf: #uname eq node1.fqdn Committing seems to work, but it fails to start the stonith resource. The error I get in the logs is: Mar 05 12:30:55 node2.fqdn stonithd: [6982]: WARN: start stonith-vm1 failed, because its hostlist is empty I have Googled up previous e-mail messages about this error message but no solution was posted. Where is the hostlist set? If I try to use that as a parameter, I get an error that there is no such parameter. Just for grins I tried the equivalent thing (some of the parameter names are slightly different) using an external/ipmi stonith device and got the same error. I must be missing something very fundatmental. Thanks for any pointers, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] getting stonith working [SOLVED]
Partially solved, anyway. On Fri, 2010-03-05 at 12:52 -0700, Greg Woods wrote: crm(live)configure# primitive stonith-vm1 stonith:ipmilan params auth=straight hostname=node1.fqdn ipaddr=** login=* password=* priv=admin port=23 crm(live)configure# location nosuicide-vm1 stonith-vm1 rule -inf: #uname eq node1.fqdn Committing seems to work, but it fails to start the stonith resource. The error I get in the logs is: Mar 05 12:30:55 node2.fqdn stonithd: [6982]: WARN: start stonith-vm1 failed, because its hostlist is empty It appears that this is a generic error that can happen if there is any kind of error in the values of the parameters that can't be detected at resource creation time. In this example, it turns out that auth=straight isn't supported. After an hour or so of playing around with the stonith command, I finally got pointed to the README.ipmilan file so that I could create a config file that worked for invoking the stonith command manually. That is where I discovered that auth=straight does not work on my systems, but auth=md2 does (it doesn't really matter what auth type I use since the IPMI devices are connected by a crossover cable and are not on a public net). Changing the value of the auth parameter from straight to md2 got rid of the empty hostlist error. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Backup of SVN repositories
On Wed, 2008-12-03 at 21:23 +, Todd, Conor (PDX) wrote: I can't do this using a crontab because one never knows which host will be running the SVN service (and have the disks mounted for it). Has anyone else tackled this issue yet? You may not know in advance whether a given host is the master, but you can check at run time. I do this and it works fine; I have a variety of cron jobs on several different heartbeat/DRBD clusters that I want to run only on the master, so just check for the presence of something that will only be there when the shared storage area is mounted: * * * * [ -d /rep/mysql ] cron-script This is when the DRBD shared disk is mounted as /rep, and /var/lib/mysql is a symlink to /rep/mysql for a MySQL service. The condition is true only when that host is the master, so cron-script only runs on the master. Kludgy but simple; works for me. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA of slpad (openLDAP) with Heartbeat and IPAddr.
On Mon, 2008-11-24 at 18:47 +0530, Divyarajsinh Jadeja wrote: Hi, I am new to Heartbeat. How can we configure openLDAP with Heartbeat for High-availability of Authentication.? I need to have slapd running on both the machine because, ldap replication needs slapd on both node. I tried it this way. I never could figure a reliable way to set things up without creating replication loops. It is far easier to use shared storage via DRBD to replicate the LDAP data rather than using LDAP replication. Then you do not need to run slapd on the node until it becomes the master and it is then a standard heartbeat-manageable resource. I do this and it works great. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Running OpenLDAP and MySQL w/Linux HA
On Wed, 2008-11-19 at 13:50 -0800, Rob Tanner wrote: The next thing to try is OpenLDAP and MySQL, both of which are critical services and both of which are for more complex. Is anyone running them on Linux HA. Does it work reliably when you switch, etc. How do you have it all configured? I run both of these under heartbeat v1. It works quite well. What you need is some shared storage, so that you don't have to mess with LDAP or MySQL replication. I found when I tried to set up one LDAP server as master and one as slave, and have the slave take over as master, it was very easy to create infinite replication loops. MySQL replication that is truly bidirectional is very difficult to get right. I found it was much easier to just create shared storage for the LDAP and MySQL database files using DRBD. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] New user questions, config file locations and hb_gui
On Thu, 2008-10-23 at 10:48 -0600, Landon Cox wrote: b) apache, postgresql, mysql and some custom services are always running on both machines to reduce startup times on failover You might want to carefully consider the tradeoff here. Getting two-way database replication to work reliably can be a huge headache. I have no experience with postgresql, but I've never been able to make this work with mysql. I found it was just easier to use DRBD to replicate the database at the disk partition level, and put up with the startup time on failover. Even with a good-size database (that stores several days worth of e-mail for our 1200-employee organization), it's at most a few seconds for mysql startup. It was a small price to pay to avoid the headaches associated with database-level replication. Do you really have an application where you can't even afford a few seconds down time at failover? It is also unclear to me that you can bind an application to an interface like eth0:0 that doesn't even exist when the application is started (it is created by heartbeat at failover time). Thus it might not even be possible to have your apps running before failover and have them listening on the service address after failover. Has anyone actually tried this? --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] New user questions, config file locations and hb_gui
On Thu, 2008-10-23 at 13:24 -0600, Landon Cox wrote: Do you really have an application where you can't even afford a few seconds down time at failover? No. Anything sub-60 seconds would be tolerated. In that case, I really think it will be easier to set up DRBD. That way you can automatically replicate anything: web content, apache config files, databases, etc. by just creating the appropriate symlinks into the shared partition. Just be certain you never put anything in the shared partition that is needed at boot time or when the machine is not in primary mode (an obvious and particularly stupid example of this would be /etc/passwd). Controlling the order so IPAddr2 fires and finishes synchronously before starting apache or postgres, for example, is feasible, correct? Yes. I personally have never used the xml-style configuration or the hb_gui, so I can't tell you exactly how you would do this. But in a v1-style haresources file, you specify the order in which resources are started, and you always have the IPaddr2 resources first, followed by drbddisk and Filesystem, and finally your service daemons last. --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] USB serial links?
I am aware that heartbeat can be done over USB links using USB-Ethernet interfaces. I specifically do not want to do that, because I am looking for a heartbeat link that will be independent of the IP stack, but on machines that do not have native serial ports. So I got a couple of Keyspan USB-Serial adapters which I have had success using on my laptop for various purposes. However, if I connect the same null modem cable between two of these that works fine between two on-board serial ports, it fails the cat test; I don't see anything on the other side. Both machines properly detect the adapter and create /dev/ttyUSB0, but the link does not work. Is a different sort of serial cable needed for this? Is there a problem with this particular type of USB Serial adapter and something else would work better? Has anyone successfully gotten a USB-Serial heartbeat to work? This is CentOS 5 on x86_64 with heartbeat 2.1.3-3.el5.centos if it matters. Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] heartbeat gets into weird state
I've been using heartbeat for years, since the 1.0 days, and I've never seen anything quite like this before. I'm running heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository) on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2 features though, it's a standard v1 configuration. I have a shared partition with DRBD. It is a dual-homed machine and both sides have a heartbeat-managed service address. The system runs a freeradius server (which only listens on one of the shared addresses because otherwise we run into problems with the radius responses coming from a different IP address than the client sent them to, which doesn't work) and some local daemons that are started out of xinetd (both under heartbeat control). In practice, up until yesterday afternoon, this has worked very well, with failovers taking only a few seconds and everything coming up properly. Yesterday, we started getting calls that radius was not working. I tried it and it worked fine. It took a while to figure out, but it turns out that radius was working, but only for clients on the subnet directly connected to the service address. The same was true of pings; I could ping the service address only from the directly-connected subnet. So this is not a radius issue. Sounds like a lost default route, right? Wrong. The routing table looked fine. And, even weirder, I could ping the local address of the same interface from off net. I could ping www.google.com from the affected host. Only the service address was not reachable from off net, but it worked fine for hosts on the local subnet. I screwed around with this for a bit while the users continued to pound on our customer service people, and finally decided to hell with it, let's just fail over to the other machine and get things working again. So I did a service heartbeat stop to cause a failover, and it hangs on the dreaded: WARN: Shutdown delayed until current resource activity finishes This basically hung forever until I hit the power button, at which point the other machine took over and all has been well since. But obviously I need to find out what happened here. Has anyone else ever seen anything like this, were the service address only works on the directly-connected subnet whereas the home address works from anywhere? I've also investigated the warning message, and all I see are people asking about this and getting no answer, or being told it's a known bug and they need to upgrade heartbeat. Is that the case for me too? Thanks, --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems