Hi Yan,

>>>> I am trying to build a 2-node cluster serving DRBD+NFS, among other
>>>> things. It has been operational on Debian Sarge, with Heartbeat 1.2, but
>>>> recently, both machines were upgraded to Debian Etch, and today I
>>>> upgraded Heartbeat to 2.0.7. I maintained the R1 style configuration.
>>>> Heartbeat is running in an active/passive fashion.
>> [snip]
>>
>>> We run /etc/init.d/nfs-kernel-server status before starting it.  If it
>>> says OK or running, then we don't start it because it's already running.
>>>
>>> See  http://linux-ha.org/HeartbeatResourceAgent
>> Thank you for the information.
>>
>> There is one other problem that I haven't been able to solve, and I hope
>> someone can help me with that too.
>>
>> Sometimes it happens that Heartbeat tries to take over a resource group
>> that it's already running:
>>
>> [EMAIL PROTECTED]:~> cl_status rscstatus
>> all
>>
>> [EMAIL PROTECTED]:~> cl_status rscstatus
>> none
>>
>> Now, when I shutdown or reboot Vodka, I would expect nothing much to
>> happen in the cluster, but instead, Heartbeat on Whisky, the node that's
>> already running things, says:
>>
>> May  7 17:21:34 whisky mach_down[11872]: [11888]: info: Taking over
>> resource group 213.207.104.20
>> May  7 17:21:34 whisky ResourceManager[11889]: [11897]: info: Acquiring
>> resource group: vodka 213.207.104.20 ipvsadm mon drbddisk::all
>> Filesystem::/dev/drbd0::/extra1::ext3 nfs-kernel-server Delay::3::0
>> IPaddr::10.50.1.20/32/eth0 mysql
>>
>> and it starts running init scripts with the 'start' argument. This is
>> bound to fail, so:
>>
>> May  7 17:21:34 whisky ResourceManager[11889]: [12047]: debug: Starting
>> /etc/init.d/mon  start
>> May  7 17:21:34 whisky ResourceManager[11889]: [12052]: debug:
>> /etc/init.d/mon  start done. RC=1
>> May  7 17:21:34 whisky ResourceManager[11889]: [12053]: ERROR: Return
>> code 1 from /etc/init.d/mon
>> May  7 17:21:34 whisky ResourceManager[11889]: [12054]: CRIT: Giving up
>> resources due to failure of mon
>> May  7 17:21:34 whisky ResourceManager[11889]: [12055]: info: Releasing
>> resource group: vodka 213.20
>> 7.104.20 ipvsadm mon drbddisk::all Filesystem::/dev/drbd0::/extra1::ext3
>> nfs-kernel-server Delay::3::0 IPaddr::10.50.1.20/32/eth0 mysql
>>
>> ... and down goes my entire cluster!!!
>>
>> Why does Heartbeat want to start a resource group that it already runs?
> 
> because mon (whatever that init script is) returned 1 on the start
> action. a return value of 1 indicates to heartbeat that the operation
> failed, and heartbeat can't safely do anything else with that resource.
> 
> Basically, there is a problem with the Resource Agent (RA) "mon". See:
> 
> http://www.linux-ha.org/LSBResourceAgent

No, you are missing the point.

'mon start' returns 1, because Mon is already running, as it should be,
since this is the active node. The question is: why is Heartbeat trying
to start Mon and all other resources, while it already runs all of them?

Thanks you.

Best regards,

Martijn Grendelman
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to