Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

Moullé Alain Wed, 02 Oct 2013 04:45:45 -0700

>>"There is one notable exception: If you have shared storage (SAN,NAS, NFS), the cause of the slowness may be external to the systemsbeing monitored, thus fencing those will not improve the situation, mostlikely."

Yes, this is exactly the case I 'm facing ...
Alain

Le 02/10/2013 13:40, Ulrich Windl a écrit :

Lars Marowsky-Bree <[email protected]> schrieb am 02.10.2013 um 09:48 in Nachricht

<[email protected]>:

On 2013-10-02T09:36:14, Ulrich Windl <[email protected]> wrote:

In general I'm afraid you cannot handle this situation in a perfect way:

You have two types of problems:
1) A node, resource, or monitor is hanging, but a long timeout prevents to
recognize this in time
2) A node, resource, or monitor is performing slower than usual, but a short
timeout causes the cluster to think there is a problem with the
node/resource/monitor

Yes, or to summarize, timeouts suck for failure detection, but for many
cases, we don't have anything better. Digging out my age old post:
http://advogato.org/person/lmb/diary/108.html

A massively overloaded system is indistinguishable from a failing or
hung one. On the plus side, if a system is *that* overloaded that
corosync isn't being scheduled and it's rather limited network traffic
presents a problem, it is likely so FUBAR'ed that fencing it doesn't
make things worse. So the misdiagnosis isn't necessarily a problem.

Hi!

There is one notable exception: If you have shared storage (SAN, NAS, NFS), the 
cause of the slowness may be external to the systems being monitored, thus 
fencing those will not improve the situation, most likely.

BTW: We had eperienced hanging I/O when one of our SAN devices had a
problem, but the others did not. Still the SLES11 SP2 kernel saw
stalled I/Os for obviously unaffected devices. The problem is being
investigated...

FC can be weird like that if it is routed through the same HBA or
switch. It's not always a kernel problem, the fabric isn't trivial
either. Good luck with finding the root cause :-/

You are argumenting that a shared media (like the Internet) may be causing one 
server to be slow if the other server is slow. That would only be plausible if 
the client is waiting for one request to the slow server to complete before 
starting a request to the faster server. If that's the case for disks instead 
of servers and a FC-SAN as shared medium, the OS really has a problem (and not 
the shared medium).

Regards,
Ulrich


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

Reply via email to