>>"There is one notable exception: If you have shared storage (SAN,
NAS, NFS), the cause of the slowness may be external to the systems
being monitored, thus fencing those will not improve the situation, most
likely."
Yes, this is exactly the case I 'm facing ...
Alain
Le 02/10/2013 13:40, Ulrich Windl a écrit :
Lars Marowsky-Bree <[email protected]> schrieb am 02.10.2013 um 09:48 in Nachricht
<[email protected]>:
On 2013-10-02T09:36:14, Ulrich Windl <[email protected]> wrote:
In general I'm afraid you cannot handle this situation in a perfect way:
You have two types of problems:
1) A node, resource, or monitor is hanging, but a long timeout prevents to
recognize this in time
2) A node, resource, or monitor is performing slower than usual, but a short
timeout causes the cluster to think there is a problem with the
node/resource/monitor
Yes, or to summarize, timeouts suck for failure detection, but for many
cases, we don't have anything better. Digging out my age old post:
http://advogato.org/person/lmb/diary/108.html
A massively overloaded system is indistinguishable from a failing or
hung one. On the plus side, if a system is *that* overloaded that
corosync isn't being scheduled and it's rather limited network traffic
presents a problem, it is likely so FUBAR'ed that fencing it doesn't
make things worse. So the misdiagnosis isn't necessarily a problem.
Hi!
There is one notable exception: If you have shared storage (SAN, NAS, NFS), the
cause of the slowness may be external to the systems being monitored, thus
fencing those will not improve the situation, most likely.
BTW: We had eperienced hanging I/O when one of our SAN devices had a
problem, but the others did not. Still the SLES11 SP2 kernel saw
stalled I/Os for obviously unaffected devices. The problem is being
investigated...
FC can be weird like that if it is routed through the same HBA or
switch. It's not always a kernel problem, the fabric isn't trivial
either. Good luck with finding the root cause :-/
You are argumenting that a shared media (like the Internet) may be causing one
server to be slow if the other server is slow. That would only be plausible if
the client is waiting for one request to the slow server to complete before
starting a request to the faster server. If that's the case for disks instead
of servers and a FC-SAN as shared medium, the OS really has a problem (and not
the shared medium).
Regards,
Ulrich
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems