Re: [Linux-HA] Antw: Re: General question about heartbeat tokens and node overloaded.

Lars Ellenberg Wed, 02 Oct 2013 02:01:02 -0700

On Wed, Oct 02, 2013 at 09:37:10AM +0200, Ulrich Windl wrote:
> >>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um 19:09 in
> Nachricht <[email protected]>:
> > Hi,
> > 
> > On Tue, Oct 01, 2013 at 05:01:52PM +0200, Moullé Alain wrote:
> >> Hi,
> >> 
> >> with stack Pacemaker/corosync;
> >> 
> >> suppose that a node in a HA cluster is so loaded (IOs, etc.) during
> >> more than the heartbeat timeout value but temporarily loaded, so
> >> loaded that it can't even no more manage heartbeat tokens, and it is
> >> fenced because he can't manage heartbeat tokens, whereis there is
> >> not a real problem, just a node temporarily overloaded.
> >> 
> >> how do you/could we manage this type of problem ?
> >> is there a way to always give higher priority to the corosync
> >> traffic upon any other load ?
> > 
> > The corosync process should be running at a higher priority
> > (i.e. close to real-time). Doesn't it?
> 
> Once we come to I/O, scheduling priorities help nothing. Really.


That's why cluster core relevant processes are not only realtime,
but also memlocked, so they won't need IO,
at least not to answer cluster heartbeats.

But yes, calling monitor actions of resource agents
may still involve IO, and then time out.
In  which case those timeouts are simply too small:
http://www.advogato.org/person/lmb/diary/108.html

If you are paranoid enough (some are),
you put all cluster relevant stuff in a ram disk,
including shells, interpreters, and supporting libraries,
and audit all participating applications for memleaks
and similar.

And tune some sysctl in a way that just because you are hard
out of memory, you still won't drop network packets.

Restricting possible inputs to the system,
and auditing all its pieces properly
is a lot of work.  But it can be done.
And in areas where public health and safety is concerned,
it is, to some extend, even required by law.

On your average box with local user access,
LAMP stack and possible runaway processes
and home grown "fork-bomb" like application behaviour,
this does not apply.
HA-clustering is not to get away with crappy applications,
even if some try to use it that way.

But at some point,
if a node is so busy it cannot get anything done anymore, 
maybe it is better for overall behaviour to put it down.

Unless death-by-overload becomes a "frequent" problem,
in which case there is nothing the cluster can do really:
You then need to revisit capacity planing,
and put in some throttles in the right places
(which typically are application dependend)
to prevent overloading the system in the first place.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: General question about heartbeat tokens and node overloaded.

Reply via email to