Thanks for the response.

Yes from the logs in a 4 second interval while the nic was disconnected 
lrmd attempts to
start the STONITH resource twice. 

A couple of observations/questions :

- Where is this behaviour/logic documented ? I've had great difficulty 
finding information
  on ha-linux. I eventually stumbled across the pacemaker site but this 
just raises more
  questions on what I should be running. ie If I'm looking at implementing 
a drbd/heartbeat
  system in the next 3 months should I stick with the 2.1.3 heartbeat 
packages or go 
  to pacemaker/heartbeat from the start.
- Why shouldn't  we be able to change this behaviour with the 
on_fail="restart", maybe
   also with the "interval" parameter ? 
   Wouldn't it be desirable to have the cluster recover without user 
intervention ?
   In our case eth0 is connected to a switch which is only used for 
connecting
   to the ilo cards - I would far rather have the cluster recover then 
receive a call at 3am.
- After some trial I found that to cleanup the resource I needed to 
specify :
  crm_resource -C -H hatest1 -r CL_stonith_node02:0    and 
  crm_resource -C -H hatest1 -r CL_stonith_node02:1
  - shouldn't I just have to do : crm_resource -C  -r CL_stonith_node02
  - from hb_gui however I can just click on CL_stonithset_node02 and
    select "Cleanup Resource" and it does work. The problem in the gui
    is that the "Failed actions" as shown by crm_mon are not present.
     from the gui
  - also the man page for crm_resource has :
    --cleanup, -C
                    Delete a resource from the LRM.
                    Requires: -r.  Optional: -H

 

David K Livingstone
CN Signals and Communications
10229 127 Avenue floor 2
Edmonton, AB, T5E 0B9
Ph  : 780 472-3959 Fax : 780 472-3050
Email: [EMAIL PROTECTED] 



"Andrew Beekhof" <[EMAIL PROTECTED]> 
2008/05/22 00:46

To
[EMAIL PROTECTED], "General Linux-HA mailing list" 
<[email protected]>
cc

Subject
Re: [Linux-HA] riloe does not restart after pulling/inserting nic





On Thu, May 22, 2008 at 12:07 AM, David Livingstone
<[EMAIL PROTECTED]> wrote:
>
>
>
> I have a test two node(hatest1/hatest2) heartbeat/drbd setup
> on HP Proliant DL380G4 hw with the following
> heartbeat pkgs loaded :
> heartbeat-gui-2.1.3-3.el5.centos
> heartbeat-2.1.3-3.el5.centos
> heartbeat-pils-2.1.3-3.el5.centos
> heartbeat-stonith-2.1.3-3.el5.centos
>
> I am using the ilo2 cards and external/riloe
> for STONITH. I can issue a stonith command
> correctly as follows :
> [EMAIL PROTECTED] tmp]#  stonith -v -t external/riloe hostlist=hatest2
> ilo_hostname=hatest2-ilo ilo_user=Heartbeat ilo_password=entropyilo
> ilo_protocol=2.0 ilo_can_reset=0 ilo_powerdown_method=button -S
> stonith: external/riloe device OK.
>
> I have also tested stonith by killing all the
> heartbeat packages on hatest2 - it was correctly
> rebooted by hatest1.
>
> The problem I have run into is when I temporarily
> pulled the eth0(riloe connected) interface on
> hatest1. I kept the interface pulled for 60
> seconds before re-inserting. The problem is that
> the cloned stonith resource is not re-started.

Actualy they are, you can see that from the failed actions section.
The problem is that those starts failed and failed starts are
considered fatal - we don't try to start again until you clean things
up with crm_resource -C

> I have attached the hb_report which covers
> starting heartbeat, pulling and re-inserting
> eth0. Here is the output of crm_mon :
>
> ============
> Last updated: Wed May 21 15:49:26 2008
> Current DC: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f)
> 2 Nodes configured.
> 3 Resources configured.
> ============
>
> Node: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f): online
> Node: hatest1 (ca34aa97-8cf5-41be-bf2f-2c3585a1661d): online
>
> Resource Group: group_1
>    IPaddr_165_115_204_197      (heartbeat::ocf:IPaddr):        Started
> hatest1
>    drbddisk_2  (heartbeat:drbddisk):   Started hatest1
>    Filesystem_3        (heartbeat::ocf:Filesystem):    Started hatest1
>    rc.primary_5        (lsb:rc.primary):       Started hatest1
> Clone Set: CL_stonithset_node01
>    CL_stonith_node01:0 (stonith:external/riloe):       Started hatest2
>    CL_stonith_node01:1 (stonith:external/riloe):       Stopped
>
> Failed actions:
>    CL_stonith_node02:0_start_0 (node=hatest1, call=22, rc=14): Error
>    CL_stonith_node02:1_start_0 (node=hatest1, call=24, rc=1): Error
> [EMAIL PROTECTED] tmp]#
>
> When I first had this problem my cib.xml entry for CL_stonith_node02 
looked
> like this :
>
>        <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe-iders" provider="heartbeat">
>           <operations>
>             <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor"/>
>             <op name="start" timeout="60s" 
id="CL_stonith_node02_start"/>
>           </operations>
>
> I changed it to this thinking that the on_fail="restart" might have an
> effect but it doesn't.
>
>
>         <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe" provider="heartbeat">
>           <operations>
>             <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor" start_delay="0" disabled="false"
> role="Started" on_fail="restart"/>
>             <op name="start" timeout="60s" id="CL_stonith_node02_start"
> start_delay="0" disabled="false" role="Started" on_fail="restart"/>
>           </operations>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to