Re: [Linux-HA] riloe does not restart after pulling/inserting nic

David . Livingstone Fri, 23 May 2008 13:38:53 -0700

"Andrew Beekhof" <[EMAIL PROTECTED]> wrote on 2008/05/23 06:24:35:


> On Thu, May 22, 2008 at 7:31 PM,  <[EMAIL PROTECTED]> wrote:
> >
> > Thanks for the response.
> >
> > Yes from the logs in a 4 second interval while the nic was 
disconnected lrmd
> > attempts to
> > start the STONITH resource twice.
> >
> > A couple of observations/questions :
> >
> > - Where is this behaviour/logic documented ?

> you mean the bit about start failures being fatal?
> hmmm - good question - I'm not sure we ever did document that, despite
> it being part of the design from day 1.

> > I've had great difficulty
> > finding information
> >   on ha-linux. I eventually stumbled across the pacemaker site butthis 
just
> > raises more
> >   questions on what I should be running. ie If I'm looking at 
implementing a
> > drbd/heartbeat
> >   system in the next 3 months should I stick with the 2.1.3 heartbeat
> > packages or go
> >   to pacemaker/heartbeat from the start.

> pacemaker/heartbeat from the start


I will look at downloading the binaries and upgrading.


> 2.1.3 was the last combined release before the CRM was split off to
> become Pacemaker.
> all CRM code has since been completely removed from the Heartbeat 
code-base.

> > - Why shouldn't  we be able to change this behaviour with the
> > on_fail="restart",

> basically, because "starts are considered special"
> however this is likely to become configurable now that that failures
> (including failed starts) can be timed out

> > maybe
> >    also with the "interval" parameter ?
> >    Wouldn't it be desirable to have the cluster recover without user
> > intervention ?

> "how?"

> we tried starting it everywhere, it failed.
> retrying forever, causing extra load and potentially downtime for
> other resources isn't the smartest thing to do.

By using on_fail="restart" and some reasonable interval value would this
really put a load on the system ? Another option for a stonith resource 
would
be to not stop the resource if the monitor/status failed - just write an 
error message
and update the Failed Actions. The first successful monitor/status would 
clear the
failed actions.

I connected to the pacemaker site and have read through some of the 
documentation including Initial Configuration and Pacemaker
Configuration Explained. Two questions :
- In Initial Configuration under Enabling Pacemaker you specify
  using "crm respawn" rather then "crm yes" if you plan to enable
  STONITH. Why ?
- In Pacemaker Configuration Explained under "How Should the
  Configuration be Updated" you talk about using XML editors.
  Do you have any recommendations ?

Thanks again

> so instead we wait for the admin to clean up the problem and tell the
> cluster its ok to continue.

> >    In our case eth0 is connected to a switch which is only used for
> > connecting
> >    to the ilo cards - I would far rather have the cluster recover then
> > receive a call at 3am.
> > - After some trial I found that to cleanup the resource I needed to 
specify
> > :
> >   crm_resource -C -H hatest1 -r CL_stonith_node02:0    and
> >   crm_resource -C -H hatest1 -r CL_stonith_node02:1
> >   - shouldn't I just have to do : crm_resource -C  -r 
CL_stonith_node02
> >   - from hb_gui however I can just click on CL_stonithset_node02 and
> >     select "Cleanup Resource" and it does work. The problem in the gui
> >     is that the "Failed actions" as shown by crm_mon are not present.
> >      from the gui
> >   - also the man page for crm_resource has :
> >     --cleanup, -C
> >                     Delete a resource from the LRM.
> >                     Requires: -r.  Optional: -H

> I believe that the documentation is incorrect.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] riloe does not restart after pulling/inserting nic

Reply via email to