Re: [Linux-HA] Monitor Retry

Andreas Mock Wed, 30 Jan 2008 07:29:33 -0800

> -----Ursprüngliche Nachricht-----
> Von: General Linux-HA mailing list <[email protected]>
> Gesendet: 30.01.08 13:44:53
> An: General Linux-HA mailing list <[email protected]>
> co-incidentally, i've been thinking about such a feature recently...
> i'm inclined to think that this functionality should be in the LRM  
> (ie. its a threshold for escalating to the CRM).
> 
> thoughts?


My thoughts if anyone is interested:

The result of the monitor action should be:
a) Resource is running.
b) Resource is not running.

But does it imply that it is running healthy? As the result of
the monitor action determines what happens to the resource,
I would say "YES". So the question is:
Does the resource run in a way able to fulfill it's service? (Yes/No)
But IMHO this question implies that the RA tries/should try to do as much
as necessary to test the service-ability. This can be pretty much.
Sometimes too much if the service is doing what it should: Working hard!

Timeout means: Nothing, no answer.

What shall I do with this information? As I said before: Shall I assume
that something is wrong or everything is o.k. 
Is everything o.k. if a service is producing so much load that I'm not even 
able to get the output of 'ps -ax'?

What is the difference between aksing once with long timeout and multiple
times with short timeout. (If I measure the time, I could know in both cases
that the RA need (too) long to get an answer).

IMHO not the timeout of the monitor action is the big problem, but the possible
chain reaction you get after this:
monitor timeout because of heavy (regulsr) load => stop action triggered =>
stop action times out (RAs try to do a graceful shutdown) because of heavy load 
=>
MESS (node fencing or resource in unmanaged state).

My proposal: Make the timeout for monitor long enough. If timeout occures assume
that there is really something wrong because a "simple" monitor action does not 
work.
=> Stop resource. Implement a two-step stop action (probably in the RA itself):
1) Try a graceful stop of resource. (e.d. db shutdown)
2) After inner timeout stop/kill resoure brutal (if possible)
3) If this doesn't work, signal timeout to upper instance which results in 
known behaviour.

What would someone win: Kill one resource brutally with the hope all other 
resources
still remain intact.

Of course interested to hear other aspects. :-)

Best regards
Andreas Mock

_____________________________________________________________________
Unbegrenzter Speicherplatz für Ihr E-Mail Postfach? Jetzt aktivieren!
http://www.digitaledienste.web.de/freemail/club/lp/?lp=7

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Monitor Retry

Reply via email to