On Thu, May 22, 2008 at 12:07 AM, David Livingstone <[EMAIL PROTECTED]> wrote: > > > > I have a test two node(hatest1/hatest2) heartbeat/drbd setup > on HP Proliant DL380G4 hw with the following > heartbeat pkgs loaded : > heartbeat-gui-2.1.3-3.el5.centos > heartbeat-2.1.3-3.el5.centos > heartbeat-pils-2.1.3-3.el5.centos > heartbeat-stonith-2.1.3-3.el5.centos > > I am using the ilo2 cards and external/riloe > for STONITH. I can issue a stonith command > correctly as follows : > [EMAIL PROTECTED] tmp]# stonith -v -t external/riloe hostlist=hatest2 > ilo_hostname=hatest2-ilo ilo_user=Heartbeat ilo_password=entropyilo > ilo_protocol=2.0 ilo_can_reset=0 ilo_powerdown_method=button -S > stonith: external/riloe device OK. > > I have also tested stonith by killing all the > heartbeat packages on hatest2 - it was correctly > rebooted by hatest1. > > The problem I have run into is when I temporarily > pulled the eth0(riloe connected) interface on > hatest1. I kept the interface pulled for 60 > seconds before re-inserting. The problem is that > the cloned stonith resource is not re-started.
Actualy they are, you can see that from the failed actions section. The problem is that those starts failed and failed starts are considered fatal - we don't try to start again until you clean things up with crm_resource -C > I have attached the hb_report which covers > starting heartbeat, pulling and re-inserting > eth0. Here is the output of crm_mon : > > ============ > Last updated: Wed May 21 15:49:26 2008 > Current DC: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f) > 2 Nodes configured. > 3 Resources configured. > ============ > > Node: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f): online > Node: hatest1 (ca34aa97-8cf5-41be-bf2f-2c3585a1661d): online > > Resource Group: group_1 > IPaddr_165_115_204_197 (heartbeat::ocf:IPaddr): Started > hatest1 > drbddisk_2 (heartbeat:drbddisk): Started hatest1 > Filesystem_3 (heartbeat::ocf:Filesystem): Started hatest1 > rc.primary_5 (lsb:rc.primary): Started hatest1 > Clone Set: CL_stonithset_node01 > CL_stonith_node01:0 (stonith:external/riloe): Started hatest2 > CL_stonith_node01:1 (stonith:external/riloe): Stopped > > Failed actions: > CL_stonith_node02:0_start_0 (node=hatest1, call=22, rc=14): Error > CL_stonith_node02:1_start_0 (node=hatest1, call=24, rc=1): Error > [EMAIL PROTECTED] tmp]# > > When I first had this problem my cib.xml entry for CL_stonith_node02 looked > like this : > > <primitive id="CL_stonith_node02" class="stonith" > type="external/riloe-iders" provider="heartbeat"> > <operations> > <op name="monitor" interval="30s" timeout="20s" > id="CL_stonith_node02_monitor"/> > <op name="start" timeout="60s" id="CL_stonith_node02_start"/> > </operations> > > I changed it to this thinking that the on_fail="restart" might have an > effect but it doesn't. > > > <primitive id="CL_stonith_node02" class="stonith" > type="external/riloe" provider="heartbeat"> > <operations> > <op name="monitor" interval="30s" timeout="20s" > id="CL_stonith_node02_monitor" start_delay="0" disabled="false" > role="Started" on_fail="restart"/> > <op name="start" timeout="60s" id="CL_stonith_node02_start" > start_delay="0" disabled="false" role="Started" on_fail="restart"/> > </operations> > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
