Hi, On Mon, Dec 15, 2008 at 05:17:11PM -0800, Gary Stansbury II wrote: > I am running heartbeat 2.1.4-04, the latest available on > SLES10-SP2.? I have configured riloe as my stonith plugin on a > four-node cluster (we have HP DL-365 G5's with integrated > ilo2), and muddled my way through attempting a clone setup and > finally settled on a primitive resource setup with one riloe > stonith resource per node. When i pkill heartbeat on any given > node, all works well and the node is reset through iloe as > expected... WHEN the stonith ilo resource for that node is > active on the dc.? When it is not active on the DC, the > expected behaviour occurs in that the DC logs that it "want a > STONITH operation RESET to node xxx", then "broadcast succeeded > require others to stonith the node xxx" and the node that IS > currently hosting the stonitth resource for that node dutifully > responds with "want a STONITH operation RESET to node xxx".? > The node hosting the stonith? resource then successfully > stonith's the dead node and attempts to notify the dc that it > was successful (the return code from running iloe from the logs > is 0, the stonithing node thinks it was successful, and > successfully send the notify to the dc), but the DC's logs show > "received T_STITmsg from myself, ignoring" then a message from
This is the DC ignoring the broadcast stonith request message. > the stonithing node with something to the effect of stonith > operation was already complete when this message was received.? > This then continues indefinitely. > > Net result is, if the stonith resource for a given node is NOT > running on the DC, and that node fails, it winds up in an > infinite reboot loop until i kill the stonith daemon on the > node hosting that stonith resource, which totally confuses the > cluster and i wind up having to reboot all the nodes. > > > I will post the logs when I get to work tomorrow... this looks > a lot like when the stonith daemon on the dc broadcasts for > help to stonith a failed node, it gives up too quickly or > ignores the success response from the "whodoit" node that > actually (successfully!) peforms the stonith. Interesting. Can you please use hb_report to produce the full report and open a bugzilla. Thanks, Dejan > Thanks in advance for your help, > > Gary > > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
