On Thu, May 22, 2008 at 12:07 AM, David Livingstone
<[EMAIL PROTECTED]> wrote:
>
>
>
> I have a test two node(hatest1/hatest2) heartbeat/drbd setup
> on HP Proliant DL380G4 hw with the following
> heartbeat pkgs loaded :
> heartbeat-gui-2.1.3-3.el5.centos
> heartbeat-2.1.3-3.el5.centos
> heartbeat-pils-2.1.3-3.el5.centos
> heartbeat-stonith-2.1.3-3.el5.centos
>
> I am using the ilo2 cards and external/riloe
> for STONITH. I can issue a stonith command
> correctly as follows :
> [EMAIL PROTECTED] tmp]#  stonith -v -t external/riloe hostlist=hatest2
> ilo_hostname=hatest2-ilo ilo_user=Heartbeat ilo_password=entropyilo
> ilo_protocol=2.0 ilo_can_reset=0 ilo_powerdown_method=button -S
> stonith: external/riloe device OK.
>
> I have also tested stonith by killing all the
> heartbeat packages on hatest2 - it was correctly
> rebooted by hatest1.
>
> The problem I have run into is when I temporarily
> pulled the eth0(riloe connected) interface on
> hatest1. I kept the interface pulled for 60
> seconds before re-inserting. The problem is that
> the cloned stonith resource is not re-started.

Actualy they are, you can see that from the failed actions section.
The problem is that those starts failed and failed starts are
considered fatal - we don't try to start again until you clean things
up with crm_resource -C

> I have attached the hb_report which covers
> starting heartbeat, pulling and re-inserting
> eth0. Here is the output of crm_mon :
>
> ============
> Last updated: Wed May 21 15:49:26 2008
> Current DC: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f)
> 2 Nodes configured.
> 3 Resources configured.
> ============
>
> Node: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f): online
> Node: hatest1 (ca34aa97-8cf5-41be-bf2f-2c3585a1661d): online
>
> Resource Group: group_1
>    IPaddr_165_115_204_197      (heartbeat::ocf:IPaddr):        Started
> hatest1
>    drbddisk_2  (heartbeat:drbddisk):   Started hatest1
>    Filesystem_3        (heartbeat::ocf:Filesystem):    Started hatest1
>    rc.primary_5        (lsb:rc.primary):       Started hatest1
> Clone Set: CL_stonithset_node01
>    CL_stonith_node01:0 (stonith:external/riloe):       Started hatest2
>    CL_stonith_node01:1 (stonith:external/riloe):       Stopped
>
> Failed actions:
>    CL_stonith_node02:0_start_0 (node=hatest1, call=22, rc=14): Error
>    CL_stonith_node02:1_start_0 (node=hatest1, call=24, rc=1): Error
> [EMAIL PROTECTED] tmp]#
>
> When I first had this problem my cib.xml entry for CL_stonith_node02 looked
> like this :
>
>        <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe-iders" provider="heartbeat">
>           <operations>
>             <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor"/>
>             <op name="start" timeout="60s" id="CL_stonith_node02_start"/>
>           </operations>
>
> I changed it to this thinking that the on_fail="restart" might have an
> effect but it doesn't.
>
>
>         <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe" provider="heartbeat">
>           <operations>
>             <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor" start_delay="0" disabled="false"
> role="Started" on_fail="restart"/>
>             <op name="start" timeout="60s" id="CL_stonith_node02_start"
> start_delay="0" disabled="false" role="Started" on_fail="restart"/>
>           </operations>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to