Re: [Linux-HA] Fencing does not work if two nodes fail simultaneously

Andrew Beekhof Wed, 16 Apr 2008 01:08:01 -0700

On Wed, Apr 16, 2008 at 9:20 AM, Alt, Martin <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I have already tried to send this mail to the list yesterday, but I did not 
> see it on the list, presumably because I had attached a 240k hb_report file. 
> Anyways, my apologies if you receive this email twice.
>
>
>
>  I am trying to set up a large cluster of 16 nodes using heartbeat-2.1.3 on 
> RHEL5.1. The servers are HP Blades and fencing is realised using a custom 
> stonith agent that uses the HP onboard administrator to reset a node.
>
>  I am currently testing fencing and everything works quite fine for single 
> node failures. If a single node fails (simulated by 'killall -9 heartbeat'), 
> it is correctly fenced and rebooted.
>
>  However, if two or more nodes fail simultaneously (or at least within a few 
> seconds), the stonith agents do not get called and the nodes are not fenced.
>
>  Stonith is configured using four different stonith agents (each one being 
> responsible for four blades), each running as a clone with four instances.
>
>  Looking through the logs, I think the problem is as follows:
>  When the first node (bladed1) fails, the following happens:
>  ...
>  pengine[7959]: 2008/04/15_11:58:26 WARN: stage6: Scheduling Node bladed1 for 
> STONITH
>  pengine[7959]: 2008/04/15_11:58:26 info: native_stop_constraints: 
> stonith-hpoa-encla:1_stop_0 is implicit after bladed1 is fenced
>  ...
>  pengine[7959]: 2008/04/15_11:58:26 notice: NoRoleChange: Move  resource 
> stonith-hpoa-encla:1    (bladed1 -> bladeb2)
>  pengine[7959]: 2008/04/15_11:58:26 notice: StopRsc:   bladed1    Stop 
> stonith-hpoa-encla:1
>  pengine[7959]: 2008/04/15_11:58:26 notice: StartRsc:  bladeb2    Start 
> stonith-hpoa-encla:1
>  ...
>  At this point, bladed1 has not actually been fenced!


The PE logs only indicate (roughly) what will happen and don't imply
any ordering (for that you need to look at the tengine logs).

That said, you're right, the stonith agent will be started on the
second machine before the node is shot.
However this is for a very good reason - what if that agent was the
one that was needed to do the shooting?

Having said that, we should still be able to perform the necessary
stonith actions.
Can you use hb_report to create an archive of the situation and attach
it to a bug?

> There was no log entry from stonithd and no stonith agent was called. 
> However, the stonith clone that was running on bladed1 is appearently 
> scheduled to run on bladeb2.
>
>  Now the second node (bladed2) fails, and it seems heartbeat is trying to 
> check which resources are running where, with the following results:
>
>  ...
>  pengine[7959]: 2008/04/15_11:58:31 ERROR: native_add_running: Resource 
> stonith::external/hpoa-encla:stonith-hpoa-encla:1 appears to be active on 2 
> nodes.
>  pengine[7959]: 2008/04/15_11:58:31 ERROR: See 
> http://linux-ha.org/v2/faq/resource_too_active for more information.
>  ...
>  pengine[7959]: 2008/04/15_11:58:31 notice: native_print:     
> stonith-hpoa-encla:1       (stonith:external/hpoa-encla)
>  pengine[7959]: 2008/04/15_11:58:31 notice: native_print:        0 : bladed1
>  pengine[7959]: 2008/04/15_11:58:31 notice: native_print:        1 : bladeb2
>  ...
>
>  so it appears that heartbeat thinks the stonith clone stonith-hpoa-encla:1 
> is running on two nodes and terefore somehow refuses to actually use any of 
> the remaining stonith clone instances...
>
>  Is there any way to solve this problem? Or more specifically:
>  - Has anybody else had any problems (or ever tried) with two nodes failing 
> simultaneously?
>  - Why is heartbeat not using any of the remaining stonith agents? Or are the 
> log entries above unrelated to my problem?
>  - looking at http://linux-ha.org/v2/faq/resource_too_active it says that the 
> problem could also result from the monitor action not being implemented 
> properly - my stonith agent does not implement monitor at all (only 'status') 
> and I am not sure what it should do (since there is no start or stop either)
>
>  Many thanks and best regards,
>  Martin
>
>
>
>
>  --
>  Dr. Martin Alt
>  System und Softwarearchitektur
>  Plath GmbH
>  Gotenstrasse   18
>  D - 20097 Hamburg
>  Tel: +49 40/237 34-361
>  Fax: +49 40/237 34-173
>  Email: [EMAIL PROTECTED]
>  http://www.plath.de
>
>  Hamburg HRB7401
>  Geschäftsführer: Dipl.-Kfm. Nico Scharfe
>
>
>  _______________________________________________
>  Linux-HA mailing list
>  [email protected]
>  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fencing does not work if two nodes fail simultaneously

Reply via email to