Hi, I have already tried to send this mail to the list yesterday, but I did not see it on the list, presumably because I had attached a 240k hb_report file. Anyways, my apologies if you receive this email twice.
I am trying to set up a large cluster of 16 nodes using heartbeat-2.1.3 on RHEL5.1. The servers are HP Blades and fencing is realised using a custom stonith agent that uses the HP onboard administrator to reset a node. I am currently testing fencing and everything works quite fine for single node failures. If a single node fails (simulated by 'killall -9 heartbeat'), it is correctly fenced and rebooted. However, if two or more nodes fail simultaneously (or at least within a few seconds), the stonith agents do not get called and the nodes are not fenced. Stonith is configured using four different stonith agents (each one being responsible for four blades), each running as a clone with four instances. Looking through the logs, I think the problem is as follows: When the first node (bladed1) fails, the following happens: ... pengine[7959]: 2008/04/15_11:58:26 WARN: stage6: Scheduling Node bladed1 for STONITH pengine[7959]: 2008/04/15_11:58:26 info: native_stop_constraints: stonith-hpoa-encla:1_stop_0 is implicit after bladed1 is fenced ... pengine[7959]: 2008/04/15_11:58:26 notice: NoRoleChange: Move resource stonith-hpoa-encla:1 (bladed1 -> bladeb2) pengine[7959]: 2008/04/15_11:58:26 notice: StopRsc: bladed1 Stop stonith-hpoa-encla:1 pengine[7959]: 2008/04/15_11:58:26 notice: StartRsc: bladeb2 Start stonith-hpoa-encla:1 ... At this point, bladed1 has not actually been fenced! There was no log entry from stonithd and no stonith agent was called. However, the stonith clone that was running on bladed1 is appearently scheduled to run on bladeb2. Now the second node (bladed2) fails, and it seems heartbeat is trying to check which resources are running where, with the following results: ... pengine[7959]: 2008/04/15_11:58:31 ERROR: native_add_running: Resource stonith::external/hpoa-encla:stonith-hpoa-encla:1 appears to be active on 2 nodes. pengine[7959]: 2008/04/15_11:58:31 ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. ... pengine[7959]: 2008/04/15_11:58:31 notice: native_print: stonith-hpoa-encla:1 (stonith:external/hpoa-encla) pengine[7959]: 2008/04/15_11:58:31 notice: native_print: 0 : bladed1 pengine[7959]: 2008/04/15_11:58:31 notice: native_print: 1 : bladeb2 ... so it appears that heartbeat thinks the stonith clone stonith-hpoa-encla:1 is running on two nodes and terefore somehow refuses to actually use any of the remaining stonith clone instances... Is there any way to solve this problem? Or more specifically: - Has anybody else had any problems (or ever tried) with two nodes failing simultaneously? - Why is heartbeat not using any of the remaining stonith agents? Or are the log entries above unrelated to my problem? - looking at http://linux-ha.org/v2/faq/resource_too_active it says that the problem could also result from the monitor action not being implemented properly - my stonith agent does not implement monitor at all (only 'status') and I am not sure what it should do (since there is no start or stop either) Many thanks and best regards, Martin -- Dr. Martin Alt System und Softwarearchitektur Plath GmbH Gotenstrasse 18 D - 20097 Hamburg Tel: +49 40/237 34-361 Fax: +49 40/237 34-173 Email: [EMAIL PROTECTED] http://www.plath.de Hamburg HRB7401 Geschäftsführer: Dipl.-Kfm. Nico Scharfe _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
