Hi,

I have already tried to send this mail to the list yesterday, but I did not see 
it on the list, presumably because I had attached a 240k hb_report file. 
Anyways, my apologies if you receive this email twice.



I am trying to set up a large cluster of 16 nodes using heartbeat-2.1.3 on 
RHEL5.1. The servers are HP Blades and fencing is realised using a custom 
stonith agent that uses the HP onboard administrator to reset a node.

I am currently testing fencing and everything works quite fine for single node 
failures. If a single node fails (simulated by 'killall -9 heartbeat'), it is 
correctly fenced and rebooted. 

However, if two or more nodes fail simultaneously (or at least within a few 
seconds), the stonith agents do not get called and the nodes are not fenced.

Stonith is configured using four different stonith agents (each one being 
responsible for four blades), each running as a clone with four instances. 

Looking through the logs, I think the problem is as follows:
When the first node (bladed1) fails, the following happens:
...
pengine[7959]: 2008/04/15_11:58:26 WARN: stage6: Scheduling Node bladed1 for 
STONITH
pengine[7959]: 2008/04/15_11:58:26 info: native_stop_constraints: 
stonith-hpoa-encla:1_stop_0 is implicit after bladed1 is fenced
...
pengine[7959]: 2008/04/15_11:58:26 notice: NoRoleChange: Move  resource 
stonith-hpoa-encla:1    (bladed1 -> bladeb2)
pengine[7959]: 2008/04/15_11:58:26 notice: StopRsc:   bladed1    Stop 
stonith-hpoa-encla:1
pengine[7959]: 2008/04/15_11:58:26 notice: StartRsc:  bladeb2    Start 
stonith-hpoa-encla:1
...
At this point, bladed1 has not actually been fenced! There was no log entry 
from stonithd and no stonith agent was called. However, the stonith clone that 
was running on bladed1 is appearently scheduled to run on bladeb2. 

Now the second node (bladed2) fails, and it seems heartbeat is trying to check 
which resources are running where, with the following results:

...
pengine[7959]: 2008/04/15_11:58:31 ERROR: native_add_running: Resource 
stonith::external/hpoa-encla:stonith-hpoa-encla:1 appears to be active on 2 
nodes.
pengine[7959]: 2008/04/15_11:58:31 ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
...
pengine[7959]: 2008/04/15_11:58:31 notice: native_print:     
stonith-hpoa-encla:1       (stonith:external/hpoa-encla)
pengine[7959]: 2008/04/15_11:58:31 notice: native_print:        0 : bladed1
pengine[7959]: 2008/04/15_11:58:31 notice: native_print:        1 : bladeb2
...

so it appears that heartbeat thinks the stonith clone stonith-hpoa-encla:1 is 
running on two nodes and terefore somehow refuses to actually use any of the 
remaining stonith clone instances...

Is there any way to solve this problem? Or more specifically:
- Has anybody else had any problems (or ever tried) with two nodes failing 
simultaneously?
- Why is heartbeat not using any of the remaining stonith agents? Or are the 
log entries above unrelated to my problem?
- looking at http://linux-ha.org/v2/faq/resource_too_active it says that the 
problem could also result from the monitor action not being implemented 
properly - my stonith agent does not implement monitor at all (only 'status') 
and I am not sure what it should do (since there is no start or stop either)

Many thanks and best regards,
Martin




--
Dr. Martin Alt
System und Softwarearchitektur
Plath GmbH
Gotenstrasse   18
D - 20097 Hamburg
Tel: +49 40/237 34-361
Fax: +49 40/237 34-173 
Email: [EMAIL PROTECTED]
http://www.plath.de

Hamburg HRB7401
Geschäftsführer: Dipl.-Kfm. Nico Scharfe
 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to