Re: [Linux-HA] riloe does not restart after pulling/inserting nic

Andrew Beekhof Fri, 23 May 2008 05:31:43 -0700

On Thu, May 22, 2008 at 7:31 PM,  <[EMAIL PROTECTED]> wrote:
>
> Thanks for the response.
>
> Yes from the logs in a 4 second interval while the nic was disconnected lrmd
> attempts to
> start the STONITH resource twice.
>
> A couple of observations/questions :
>
> - Where is this behaviour/logic documented ?


you mean the bit about start failures being fatal?
hmmm - good question - I'm not sure we ever did document that, despite
it being part of the design from day 1.

> I've had great difficulty
> finding information
>   on ha-linux. I eventually stumbled across the pacemaker site but this just
> raises more
>   questions on what I should be running. ie If I'm looking at implementing a
> drbd/heartbeat
>   system in the next 3 months should I stick with the 2.1.3 heartbeat
> packages or go
>   to pacemaker/heartbeat from the start.

pacemaker/heartbeat from the start

2.1.3 was the last combined release before the CRM was split off to
become Pacemaker.
all CRM code has since been completely removed from the Heartbeat code-base.

> - Why shouldn't  we be able to change this behaviour with the
> on_fail="restart",

basically, because "starts are considered special"
however this is likely to become configurable now that that failures
(including failed starts) can be timed out

> maybe
>    also with the "interval" parameter ?
>    Wouldn't it be desirable to have the cluster recover without user
> intervention ?

"how?"

we tried starting it everywhere, it failed.
retrying forever, causing extra load and potentially downtime for
other resources isn't the smartest thing to do.

so instead we wait for the admin to clean up the problem and tell the
cluster its ok to continue.

>    In our case eth0 is connected to a switch which is only used for
> connecting
>    to the ilo cards - I would far rather have the cluster recover then
> receive a call at 3am.
> - After some trial I found that to cleanup the resource I needed to specify
> :
>   crm_resource -C -H hatest1 -r CL_stonith_node02:0    and
>   crm_resource -C -H hatest1 -r CL_stonith_node02:1
>   - shouldn't I just have to do : crm_resource -C  -r CL_stonith_node02
>   - from hb_gui however I can just click on CL_stonithset_node02 and
>     select "Cleanup Resource" and it does work. The problem in the gui
>     is that the "Failed actions" as shown by crm_mon are not present.
>      from the gui
>   - also the man page for crm_resource has :
>     --cleanup, -C
>                     Delete a resource from the LRM.
>                     Requires: -r.  Optional: -H

I believe that the documentation is incorrect.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] riloe does not restart after pulling/inserting nic

Reply via email to