Re: [Linux-HA] Re: Heartbeat and RA monitor functions

Andrew Beekhof Wed, 28 May 2008 03:46:46 -0700

On Wed, May 28, 2008 at 11:02 AM, Joe Bill <[EMAIL PROTECTED]> wrote:
>>> I also assume that HB performs a 'monitor
>>> check-level 0' *after* a successful "start"
>>> or "stop".
>>> Is this correct ?
>
>> No. If the RA exits with success (0) the action's
>> considered to have been successful. (Hmm, how does
>> that sound :)
>
> Ugh ! Terrible, really ( ;-) ) !


So when you stop or start an init.d service on the command line, you
always run a status operation after it reports success?
Right, and neither do we.

It is the job of the RA to make sure it has finished doing whatever it
was asked to do before reporting the operation is complete.

It is also advisable that it accurately report the service's true
state after a start operation and mandatory for a stop.
This is easily done by calling a level 0 check in a loop at the end of
both functions.
Calling more intensive checks is up to the RA writer and in the case
of stops, depends on the chances of the level 0 check being incorrect.

These are not excessive requirements.

> It suggests that the RA implementation of the 'START'
> and 'STOP' operations, should include the code as to
> perform, before exiting, ALL the tests that are
> carried out at ALL implemented check-levels of
> monitoring, as to reliably return a resource's status,
> which negates the advantage of having more than 1
> monitoring function.
>
>>> Or, I assume that, if HB performs a 'monitor
>>> check-level 0', and that operation returns a
>>> "ERROR" status, HB automatically performs
>>> another 'monitor' of the same resource but with
>>> the next check-level known by HB ...
>
>> No.
>
> Ugh!
>
> This means that repairable states of a resource should
> be tested for and repaired at *all* check-levels
> instead of expecting Heartbeat to gradually shift from
> check-level 0 to check-level 20.

All your checks are performed at the intervals you request.
The failure of any one check means the service is considered as failed.

>
> This significantly increases the duration of
> monitoring at the lower check-levels, which beats the
> purpose of check-levels, *to least impact the QOS*, as
> described in the OCF spec:
>
> "3.5.3.1. Parameters specific to the 'monitor' action
>
> OCF_CHECK_LEVEL
> 0 The most lightweight check possible, which should
> not
> have an impact on the QoS...
>
> 10 A medium weight check, expected to be called
> multiple times per minute, which should not have a
> noticeable impact on the QoS...
>
> 20 A heavy weight check, called infrequently, which
> may
> impact system or service performance..."
>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Re: Heartbeat and RA monitor functions

Reply via email to