Re: [Linux-HA] Monitor Retry

Andrew Beekhof Thu, 31 Jan 2008 03:15:29 -0800


On Jan 31, 2008, at 11:06 AM, Dejan Muhamedagic wrote:

Hi,

On Wed, Jan 30, 2008 at 04:05:41PM +0100, Andreas Mock wrote:

-----Urspr?ngliche Nachricht-----
Von: General Linux-HA mailing list <[email protected]>
Gesendet: 30.01.08 13:44:53
An: General Linux-HA mailing list <[email protected]>
co-incidentally, i've been thinking about such a feature recently...
i'm inclined to think that this functionality should be in the LRM
(ie. its a threshold for escalating to the CRM).

thoughts?


My thoughts if anyone is interested:

The result of the monitor action should be:
a) Resource is running.
b) Resource is not running.

But does it imply that it is running healthy? As the result of
the monitor action determines what happens to the resource,
I would say "YES".


Right.

So the question is:
Does the resource run in a way able to fulfill it's service? (Yes/No)
But IMHO this question implies that the RA tries/should try to doas much
as necessary to test the service-ability. This can be pretty much.
Sometimes too much if the service is doing what it should: Workinghard!


That's why multiple levels of monitoring are available.


nod.  I think we should promote this more.

Timeout means: Nothing, no answer.
That's equivalent to no service. We can't say why though.
What shall I do with this information? As I said before: Shall Iassume
that something is wrong or everything is o.k.
Is everything o.k. if a service is producing so much load that I'mnot even
able to get the output of 'ps -ax'?

That's basically up to the resource to decide and identifying thecause of the load is going to be part of that decision.

What is the difference between aksing once with long timeout andmultipletimes with short timeout. (If I measure the time, I could know inboth cases
that the RA need (too) long to get an answer).
IMHO not the timeout of the monitor action is the big problem, butthe possible
chain reaction you get after this:
monitor timeout because of heavy (regulsr) load => stop actiontriggered =>stop action times out (RAs try to do a graceful shutdown) becauseof heavy load =>
MESS (node fencing or resource in unmanaged state).
My proposal: Make the timeout for monitor long enough. If timeoutoccures assumethat there is really something wrong because a "simple" monitoraction does not work.


Exactly.

=> Stop resource. Implement a two-step stop action (probably in theRA itself):
1) Try a graceful stop of resource. (e.d. db shutdown)
2) After inner timeout stop/kill resoure brutal (if possible)
3) If this doesn't work, signal timeout to upper instance whichresults in known behaviour.


Again, you're spot on here.

The higher instance (LRM) has timeouts too. The RA should try to
stop the resource at any cost. Reasonable exceptions allowed. If
it can't stop the resource then it should let CRM deal with that
which typically results in a STONITH operation.

What would someone win: Kill one resource brutally with the hopeall other resources
still remain intact.

Of course interested to hear other aspects. :-)


There's still certain set of resources which are (unfortunately)
unreliable and can occasionaly timeout. I can recall some STONITH
devices which would timeout once out of a hundred times or so.
For those, a monitor failure could be considered transient.

Perhaps a database with occasional high load too. Under some
circumstances, i.e. if the software runs as it should and the
high load is not result of a flaw, it would be preferable not to
failover but to wait.

Thanks,

Dejan


Best regards
Andreas Mock

_____________________________________________________________________
Unbegrenzter Speicherplatz f?r Ihr E-Mail Postfach? Jetzt aktivieren!
http://www.digitaledienste.web.de/freemail/club/lp/?lp=7

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Monitor Retry

Reply via email to