Hi,

On Mon, Dec 15, 2008 at 05:17:11PM -0800, Gary Stansbury II wrote:
> I am running heartbeat 2.1.4-04, the latest available on
> SLES10-SP2.? I have configured riloe as my stonith plugin on a
> four-node cluster (we have HP DL-365 G5's with integrated
> ilo2), and muddled my way through attempting a clone setup and
> finally settled on a primitive resource setup with one riloe
> stonith resource per node. When i pkill heartbeat on any given
> node, all works well and the node is reset through iloe as
> expected... WHEN the stonith ilo resource for that node is
> active on the dc.? When it is not active on the DC, the
> expected behaviour occurs in that the DC logs that it "want a
> STONITH operation RESET to node xxx", then "broadcast succeeded
> require others to stonith the node xxx" and the node that IS
> currently hosting the stonitth resource for that node dutifully
> responds with "want a STONITH operation RESET to node xxx".?
> The node hosting the stonith? resource then successfully
> stonith's the dead node and attempts to notify the dc that it
> was successful (the return code from running iloe from the logs
> is 0, the stonithing node thinks it was successful, and
> successfully send the notify to the dc), but the DC's logs show
> "received T_STITmsg from myself, ignoring" then a message from

This is the DC ignoring the broadcast stonith request message.

> the stonithing node with something to the effect of stonith
> operation was already complete when this message was received.?
> This then continues indefinitely.
> 
> Net result is, if the stonith resource for a given node is NOT
> running on the DC, and that node fails, it winds up in an
> infinite reboot loop until i kill the stonith daemon on the
> node hosting that stonith resource, which totally confuses the
> cluster and i wind up having to reboot all the nodes.
> 
> 
> I will post the logs when I get to work tomorrow... this looks
> a lot like when the stonith daemon on the dc broadcasts for
> help to stonith a failed node, it gives up too quickly or
> ignores the success response from the "whodoit" node that
> actually (successfully!) peforms the stonith.

Interesting. Can you please use hb_report to produce the full
report and open a bugzilla.

Thanks,

Dejan

> Thanks in advance for your help, 
> 
> Gary
> 
> 
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to