Re: [Linux-HA] riloe does not restart after pulling/inserting nic

Andrew Beekhof Sun, 25 May 2008 00:36:36 -0700

On Fri, May 23, 2008 at 10:38 PM,  <[EMAIL PROTECTED]> wrote:
>
> "Andrew Beekhof" <[EMAIL PROTECTED]> wrote on 2008/05/23 06:24:35:
>
>> On Thu, May 22, 2008 at 7:31 PM,  <[EMAIL PROTECTED]> wrote:
>> >
>> > Thanks for the response.
>> >
>> > Yes from the logs in a 4 second interval while the nic was disconnected
>> > lrmd
>> > attempts to
>> > start the STONITH resource twice.
>> >
>> > A couple of observations/questions :
>> >
>> > - Where is this behaviour/logic documented ?
>
>> you mean the bit about start failures being fatal?
>> hmmm - good question - I'm not sure we ever did document that, despite
>> it being part of the design from day 1.
>
>> > I've had great difficulty
>> > finding information
>> >   on ha-linux. I eventually stumbled across the pacemaker site butthis
>> > just
>> > raises more
>> >   questions on what I should be running. ie If I'm looking at
>> > implementing a
>> > drbd/heartbeat
>> >   system in the next 3 months should I stick with the 2.1.3 heartbeat
>> > packages or go
>> >   to pacemaker/heartbeat from the start.
>
>> pacemaker/heartbeat from the start
>
>
> I will look at downloading the binaries and upgrading.
>
>
>> 2.1.3 was the last combined release before the CRM was split off to
>> become Pacemaker.
>> all CRM code has since been completely removed from the Heartbeat
>> code-base.
>
>> > - Why shouldn't  we be able to change this behaviour with the
>> > on_fail="restart",
>
>> basically, because "starts are considered special"
>> however this is likely to become configurable now that that failures
>> (including failed starts) can be timed out
>
>> > maybe
>> >    also with the "interval" parameter ?
>> >    Wouldn't it be desirable to have the cluster recover without user
>> > intervention ?
>
>> "how?"
>
>> we tried starting it everywhere, it failed.
>> retrying forever, causing extra load and potentially downtime for
>> other resources isn't the smartest thing to do.
>
> By using on_fail="restart" and some reasonable interval value


this is the piece we had always missed - the ability to timeout the
failures after an interval and thus throttle the impact of a
non-transient error.  without this, the only sane thing to do is make
start failures permanent.

> would this
> really put a load on the system ? Another option for a stonith resource
> would
> be to not stop the resource if the monitor/status failed - just write an
> error message
> and update the Failed Actions.

there is on_fail=ignore which would do this

> The first successful monitor/status would
> clear the
> failed actions.
>
> I connected to the pacemaker site and have read through some of the
> documentation including Initial Configuration and Pacemaker
> Configuration Explained. Two questions :
> - In Initial Configuration under Enabling Pacemaker you specify
>   using "crm respawn" rather then "crm yes" if you plan to enable
>   STONITH. Why ?

Because later version of heartbeat changed the semantics of "crm yes"
which caused the node to reboot if any of heartbeat's child process
ever quit.  If STONITH is enabled, then it is very debatable that this
behavior provides any real benefit.
The downside of using "crm yes" is that if the reason the child quit
is persistent, then you have a window of about 2 seconds to debug and
fix the issue.

> - In Pacemaker Configuration Explained under "How Should the
>   Configuration be Updated" you talk about using XML editors.
>   Do you have any recommendations ?

I've used xmlspy in the past, but mostly I know the syntax pretty well
and just use emacs :-)
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] riloe does not restart after pulling/inserting nic

Reply via email to