Dejan Muhamedagic wrote:
> Hi,
> 
> On Sun, Feb 24, 2008 at 10:51:25PM +0100, Johan Hoeke wrote:
>> Dejan Muhamedagic wrote:
>>
>>> On Fri, Feb 22, 2008 at 05:29:08PM +0100, Johan Hoeke wrote:
>>>> Dejan Muhamedagic wrote:
>>>>> Hi,
>>>> <snip>
>>>>> But the stonith resource is monitored? How did it fail?
>> an incomplete iptables config was pushed by mistake, it caused errors
>> during a upgrade from 2.1.2 to 2.1.3:
>>
>> (2 node cluster, hosts cauchy and condorcet)
>>
>> 08:51 condorcet is updated to heartbeat 2.1.3 and is coming up from a
>> reboot. bad iptable rules are activated, condorcet can't see cauchy's
>> heartbeat:
>>
>> Feb 13 08:51:58 condorcet heartbeat: [3752]: WARN: node cauchy.uvt.nl:
>> is dead
>>
>> 08:52 condorcet shoots cauchy
>>
>> *The story would have ended here if the stonith action was power off or
>> if heartbeat wasn't started on reboot, but alas. We will choose one of
>> the two options to avoid future trouble.*
>>
>> Feb 13 08:52:36 condorcet pengine: [4549]: WARN: stage6: Scheduling Node
>> cauchy.uvt.nl for STONITH
>>
>> Feb 13 08:52:44 condorcet tengine: [4548]: info: te_fence_node:
>> Executing reboot fencing operation (34) on cauchy.uvt.nl (timeout=50000)
>>
>> Feb 13 08:52:44 condorcet stonithd: [4542]: info:
>> stonith_operate_locally::2375: sending fencing op (RESET) for
>> cauchy.uvt.nl to device external (rsc_id=R_ilo_cauchy:0, pid=4752)
>>
>> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc:
>> condorcet.uvt.nl  Start R_san_oradata
>>
>> condorcet starts the resource that mounts the SAN disk
>> *I would very much prefer that it waits until the stonith is done
>> and has succeeded before it does this!*
> 
> This is obviusly a bug. It has already been reported and
> supposedly fixed before 2.1.3. Please attach this report and
> reopen:
> 
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1768

OK, will do.

> 
>> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc:
>> condorcet.uvt.nl  Start R_san_oradata
>>
>> *only now has the stonith succeeded 08:52:49*
>>
>> Feb 13 08:52:49 condorcet stonithd: [4542]: info: Succeeded to STONITH
>> the node cauchy.uvt.nl: optype=RESET. whodoit: condorcet.uvt.nl
>>
>> condorcet continues to mount the SAN partition, as it should:
>>
>> Feb 13 08:52:44 condorcet Filesystem[4805]: [4835]: INFO: Running start
>> for /dev/mapper/san-oradata on /var/oracle/oradata
>>
>> BUT:
>>
>> 08:55 cauchy comes up, bad iptables settings activate, cauchy can't see
>> heartbeat from condorcet:
>>
>> Feb 13 08:55:42 cauchy heartbeat: [3801]: WARN: node condorcet.uvt.nl:
>> is dead
>>
>> 08:56 cauchy wants to mount the f/o attached SAN partition:
>> *Ideally, this should only be done only if cauchy is sure that condorcet
>> is really dead! iow, after the stonith has succeeded*
>>
>> Feb 13 08:56:23 cauchy pengine: [4596]: notice: StartRsc:  cauchy.uvt.nl
>>        Start R_san_oradata
>>
>> The stonith action starts after the resource for the SAN partition:
>>
>> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid:
>> 4595] want a STONITH operation RESET to node condorcet.uvt.nl.
>>
>> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing
>> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000)
>>
>> *at this moment in time the filesystem is corrupted because it is
>> mounted on both nodes at the same time*
>>
>> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid:
>> 4595] want a STONITH operation RESET to node condorcet.uvt.nl.
>>
>> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing
>> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000)
>>
>> Feb 13 08:56:30 cauchy Filesystem[4823]: [4852]: INFO: Running start for
>> /dev/mapper/san-oradata on /var/oracle/oradata
>>
>> this action times out:
>>
>> Feb 13 08:57:20 cauchy stonithd: [4589]: ERROR: Failed to STONITH the
>> node condorcet.uvt.nl: optype=RESET, op_result=TIMEOUT
>> Feb 13 08:57:20 cauchy tengine: [4595]: ERROR: tengine_stonith_callback:
>> Stonith of condorcet.uvt.nl failed (2)... aborting transition.
> 
> And here the CRM waited for the stonith to finish. Strange.
> 

According to the logs cauchy was still 2.1.2 at that point in time.
Maybe that was the reason for any difference in behavior. I still see it
as not waiting for the stonith though. The SAN resource is started while
the stoinith is still running.

>> but that is no longer relevant.
>>
>> conclusion:
>>
>> As Dejan mentioned, setting heartbeat not to start automatically, or
>> changing the stonith action to power off would have saved the day.
> 
> This should be documented as best practice for two node clusters.
> IIRC, there has already been discussion on the list on this issue.
> 
>> I am curious about the timing of some of the actions though,
>> particularly that a node seems to continue with it's start actions even
>> though the success or failure of the stonith action has not been
>> confirmed. Could be that i'm interpreting the logs iincorrectly.
> 
> Your interpretation's right.
> 
> You should also try ciblint to check the cib.

OK, i'll try that as well

> 
> Thanks,
> 
> Dejan
> 

Thanks for your time as always,

regards,

Johan


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to