Hi Yan, >>>> I am trying to build a 2-node cluster serving DRBD+NFS, among other >>>> things. It has been operational on Debian Sarge, with Heartbeat 1.2, but >>>> recently, both machines were upgraded to Debian Etch, and today I >>>> upgraded Heartbeat to 2.0.7. I maintained the R1 style configuration. >>>> Heartbeat is running in an active/passive fashion. >> [snip] >> >>> We run /etc/init.d/nfs-kernel-server status before starting it. If it >>> says OK or running, then we don't start it because it's already running. >>> >>> See http://linux-ha.org/HeartbeatResourceAgent >> Thank you for the information. >> >> There is one other problem that I haven't been able to solve, and I hope >> someone can help me with that too. >> >> Sometimes it happens that Heartbeat tries to take over a resource group >> that it's already running: >> >> [EMAIL PROTECTED]:~> cl_status rscstatus >> all >> >> [EMAIL PROTECTED]:~> cl_status rscstatus >> none >> >> Now, when I shutdown or reboot Vodka, I would expect nothing much to >> happen in the cluster, but instead, Heartbeat on Whisky, the node that's >> already running things, says: >> >> May 7 17:21:34 whisky mach_down[11872]: [11888]: info: Taking over >> resource group 213.207.104.20 >> May 7 17:21:34 whisky ResourceManager[11889]: [11897]: info: Acquiring >> resource group: vodka 213.207.104.20 ipvsadm mon drbddisk::all >> Filesystem::/dev/drbd0::/extra1::ext3 nfs-kernel-server Delay::3::0 >> IPaddr::10.50.1.20/32/eth0 mysql >> >> and it starts running init scripts with the 'start' argument. This is >> bound to fail, so: >> >> May 7 17:21:34 whisky ResourceManager[11889]: [12047]: debug: Starting >> /etc/init.d/mon start >> May 7 17:21:34 whisky ResourceManager[11889]: [12052]: debug: >> /etc/init.d/mon start done. RC=1 >> May 7 17:21:34 whisky ResourceManager[11889]: [12053]: ERROR: Return >> code 1 from /etc/init.d/mon >> May 7 17:21:34 whisky ResourceManager[11889]: [12054]: CRIT: Giving up >> resources due to failure of mon >> May 7 17:21:34 whisky ResourceManager[11889]: [12055]: info: Releasing >> resource group: vodka 213.20 >> 7.104.20 ipvsadm mon drbddisk::all Filesystem::/dev/drbd0::/extra1::ext3 >> nfs-kernel-server Delay::3::0 IPaddr::10.50.1.20/32/eth0 mysql >> >> ... and down goes my entire cluster!!! >> >> Why does Heartbeat want to start a resource group that it already runs? > > because mon (whatever that init script is) returned 1 on the start > action. a return value of 1 indicates to heartbeat that the operation > failed, and heartbeat can't safely do anything else with that resource. > > Basically, there is a problem with the Resource Agent (RA) "mon". See: > > http://www.linux-ha.org/LSBResourceAgent
No, you are missing the point. 'mon start' returns 1, because Mon is already running, as it should be, since this is the active node. The question is: why is Heartbeat trying to start Mon and all other resources, while it already runs all of them? Thanks you. Best regards, Martijn Grendelman _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
