Dejan Muhamedagic wrote: > Hi, > > On Sun, Feb 24, 2008 at 10:51:25PM +0100, Johan Hoeke wrote: >> Dejan Muhamedagic wrote: >> >>> On Fri, Feb 22, 2008 at 05:29:08PM +0100, Johan Hoeke wrote: >>>> Dejan Muhamedagic wrote: >>>>> Hi, >>>> <snip> >>>>> But the stonith resource is monitored? How did it fail? >> an incomplete iptables config was pushed by mistake, it caused errors >> during a upgrade from 2.1.2 to 2.1.3: >> >> (2 node cluster, hosts cauchy and condorcet) >> >> 08:51 condorcet is updated to heartbeat 2.1.3 and is coming up from a >> reboot. bad iptable rules are activated, condorcet can't see cauchy's >> heartbeat: >> >> Feb 13 08:51:58 condorcet heartbeat: [3752]: WARN: node cauchy.uvt.nl: >> is dead >> >> 08:52 condorcet shoots cauchy >> >> *The story would have ended here if the stonith action was power off or >> if heartbeat wasn't started on reboot, but alas. We will choose one of >> the two options to avoid future trouble.* >> >> Feb 13 08:52:36 condorcet pengine: [4549]: WARN: stage6: Scheduling Node >> cauchy.uvt.nl for STONITH >> >> Feb 13 08:52:44 condorcet tengine: [4548]: info: te_fence_node: >> Executing reboot fencing operation (34) on cauchy.uvt.nl (timeout=50000) >> >> Feb 13 08:52:44 condorcet stonithd: [4542]: info: >> stonith_operate_locally::2375: sending fencing op (RESET) for >> cauchy.uvt.nl to device external (rsc_id=R_ilo_cauchy:0, pid=4752) >> >> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc: >> condorcet.uvt.nl Start R_san_oradata >> >> condorcet starts the resource that mounts the SAN disk >> *I would very much prefer that it waits until the stonith is done >> and has succeeded before it does this!* > > This is obviusly a bug. It has already been reported and > supposedly fixed before 2.1.3. Please attach this report and > reopen: > > http://developerbugs.linux-foundation.org/show_bug.cgi?id=1768
OK, will do. > >> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc: >> condorcet.uvt.nl Start R_san_oradata >> >> *only now has the stonith succeeded 08:52:49* >> >> Feb 13 08:52:49 condorcet stonithd: [4542]: info: Succeeded to STONITH >> the node cauchy.uvt.nl: optype=RESET. whodoit: condorcet.uvt.nl >> >> condorcet continues to mount the SAN partition, as it should: >> >> Feb 13 08:52:44 condorcet Filesystem[4805]: [4835]: INFO: Running start >> for /dev/mapper/san-oradata on /var/oracle/oradata >> >> BUT: >> >> 08:55 cauchy comes up, bad iptables settings activate, cauchy can't see >> heartbeat from condorcet: >> >> Feb 13 08:55:42 cauchy heartbeat: [3801]: WARN: node condorcet.uvt.nl: >> is dead >> >> 08:56 cauchy wants to mount the f/o attached SAN partition: >> *Ideally, this should only be done only if cauchy is sure that condorcet >> is really dead! iow, after the stonith has succeeded* >> >> Feb 13 08:56:23 cauchy pengine: [4596]: notice: StartRsc: cauchy.uvt.nl >> Start R_san_oradata >> >> The stonith action starts after the resource for the SAN partition: >> >> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid: >> 4595] want a STONITH operation RESET to node condorcet.uvt.nl. >> >> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing >> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000) >> >> *at this moment in time the filesystem is corrupted because it is >> mounted on both nodes at the same time* >> >> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid: >> 4595] want a STONITH operation RESET to node condorcet.uvt.nl. >> >> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing >> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000) >> >> Feb 13 08:56:30 cauchy Filesystem[4823]: [4852]: INFO: Running start for >> /dev/mapper/san-oradata on /var/oracle/oradata >> >> this action times out: >> >> Feb 13 08:57:20 cauchy stonithd: [4589]: ERROR: Failed to STONITH the >> node condorcet.uvt.nl: optype=RESET, op_result=TIMEOUT >> Feb 13 08:57:20 cauchy tengine: [4595]: ERROR: tengine_stonith_callback: >> Stonith of condorcet.uvt.nl failed (2)... aborting transition. > > And here the CRM waited for the stonith to finish. Strange. > According to the logs cauchy was still 2.1.2 at that point in time. Maybe that was the reason for any difference in behavior. I still see it as not waiting for the stonith though. The SAN resource is started while the stoinith is still running. >> but that is no longer relevant. >> >> conclusion: >> >> As Dejan mentioned, setting heartbeat not to start automatically, or >> changing the stonith action to power off would have saved the day. > > This should be documented as best practice for two node clusters. > IIRC, there has already been discussion on the list on this issue. > >> I am curious about the timing of some of the actions though, >> particularly that a node seems to continue with it's start actions even >> though the success or failure of the stonith action has not been >> confirmed. Could be that i'm interpreting the logs iincorrectly. > > Your interpretation's right. > > You should also try ciblint to check the cib. OK, i'll try that as well > > Thanks, > > Dejan > Thanks for your time as always, regards, Johan
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
