Thanks for the response.
Yes from the logs in a 4 second interval while the nic was disconnected
lrmd attempts to
start the STONITH resource twice.
A couple of observations/questions :
- Where is this behaviour/logic documented ? I've had great difficulty
finding information
on ha-linux. I eventually stumbled across the pacemaker site but this
just raises more
questions on what I should be running. ie If I'm looking at implementing
a drbd/heartbeat
system in the next 3 months should I stick with the 2.1.3 heartbeat
packages or go
to pacemaker/heartbeat from the start.
- Why shouldn't we be able to change this behaviour with the
on_fail="restart", maybe
also with the "interval" parameter ?
Wouldn't it be desirable to have the cluster recover without user
intervention ?
In our case eth0 is connected to a switch which is only used for
connecting
to the ilo cards - I would far rather have the cluster recover then
receive a call at 3am.
- After some trial I found that to cleanup the resource I needed to
specify :
crm_resource -C -H hatest1 -r CL_stonith_node02:0 and
crm_resource -C -H hatest1 -r CL_stonith_node02:1
- shouldn't I just have to do : crm_resource -C -r CL_stonith_node02
- from hb_gui however I can just click on CL_stonithset_node02 and
select "Cleanup Resource" and it does work. The problem in the gui
is that the "Failed actions" as shown by crm_mon are not present.
from the gui
- also the man page for crm_resource has :
--cleanup, -C
Delete a resource from the LRM.
Requires: -r. Optional: -H
David K Livingstone
CN Signals and Communications
10229 127 Avenue floor 2
Edmonton, AB, T5E 0B9
Ph : 780 472-3959 Fax : 780 472-3050
Email: [EMAIL PROTECTED]
"Andrew Beekhof" <[EMAIL PROTECTED]>
2008/05/22 00:46
To
[EMAIL PROTECTED], "General Linux-HA mailing list"
<[email protected]>
cc
Subject
Re: [Linux-HA] riloe does not restart after pulling/inserting nic
On Thu, May 22, 2008 at 12:07 AM, David Livingstone
<[EMAIL PROTECTED]> wrote:
>
>
>
> I have a test two node(hatest1/hatest2) heartbeat/drbd setup
> on HP Proliant DL380G4 hw with the following
> heartbeat pkgs loaded :
> heartbeat-gui-2.1.3-3.el5.centos
> heartbeat-2.1.3-3.el5.centos
> heartbeat-pils-2.1.3-3.el5.centos
> heartbeat-stonith-2.1.3-3.el5.centos
>
> I am using the ilo2 cards and external/riloe
> for STONITH. I can issue a stonith command
> correctly as follows :
> [EMAIL PROTECTED] tmp]# stonith -v -t external/riloe hostlist=hatest2
> ilo_hostname=hatest2-ilo ilo_user=Heartbeat ilo_password=entropyilo
> ilo_protocol=2.0 ilo_can_reset=0 ilo_powerdown_method=button -S
> stonith: external/riloe device OK.
>
> I have also tested stonith by killing all the
> heartbeat packages on hatest2 - it was correctly
> rebooted by hatest1.
>
> The problem I have run into is when I temporarily
> pulled the eth0(riloe connected) interface on
> hatest1. I kept the interface pulled for 60
> seconds before re-inserting. The problem is that
> the cloned stonith resource is not re-started.
Actualy they are, you can see that from the failed actions section.
The problem is that those starts failed and failed starts are
considered fatal - we don't try to start again until you clean things
up with crm_resource -C
> I have attached the hb_report which covers
> starting heartbeat, pulling and re-inserting
> eth0. Here is the output of crm_mon :
>
> ============
> Last updated: Wed May 21 15:49:26 2008
> Current DC: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f)
> 2 Nodes configured.
> 3 Resources configured.
> ============
>
> Node: hatest2 (040affcf-fef9-42ae-ab98-90f3e133da2f): online
> Node: hatest1 (ca34aa97-8cf5-41be-bf2f-2c3585a1661d): online
>
> Resource Group: group_1
> IPaddr_165_115_204_197 (heartbeat::ocf:IPaddr): Started
> hatest1
> drbddisk_2 (heartbeat:drbddisk): Started hatest1
> Filesystem_3 (heartbeat::ocf:Filesystem): Started hatest1
> rc.primary_5 (lsb:rc.primary): Started hatest1
> Clone Set: CL_stonithset_node01
> CL_stonith_node01:0 (stonith:external/riloe): Started hatest2
> CL_stonith_node01:1 (stonith:external/riloe): Stopped
>
> Failed actions:
> CL_stonith_node02:0_start_0 (node=hatest1, call=22, rc=14): Error
> CL_stonith_node02:1_start_0 (node=hatest1, call=24, rc=1): Error
> [EMAIL PROTECTED] tmp]#
>
> When I first had this problem my cib.xml entry for CL_stonith_node02
looked
> like this :
>
> <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe-iders" provider="heartbeat">
> <operations>
> <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor"/>
> <op name="start" timeout="60s"
id="CL_stonith_node02_start"/>
> </operations>
>
> I changed it to this thinking that the on_fail="restart" might have an
> effect but it doesn't.
>
>
> <primitive id="CL_stonith_node02" class="stonith"
> type="external/riloe" provider="heartbeat">
> <operations>
> <op name="monitor" interval="30s" timeout="20s"
> id="CL_stonith_node02_monitor" start_delay="0" disabled="false"
> role="Started" on_fail="restart"/>
> <op name="start" timeout="60s" id="CL_stonith_node02_start"
> start_delay="0" disabled="false" role="Started" on_fail="restart"/>
> </operations>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems