On Fri, 2009-10-16 at 10:26 -0700, Robinson, Eric wrote:
> > It is more a matter of misinterpretation than a
> > fault in the kernel.
>
> Fair enough. Can you help me understand how to correctly interpret the
> fact that heartbeat is having trouble sending status updates, yet sar
> output during some of those moments shows high idle and low iowait
> numbers?
You are assuming sar and top tell you everything about your system and
that is just plain wrong.
The packet could have been sent and dropped at a moment there is a very
short peak in load that gets flattened by the load measurements. A
little like starting all your nightly cronjobs at the same minute and
only measuring load over the whole night.
The delay is created by reasons that isn't directly related to the load
created by processes?
A very high amount of very short interrupts could cripple your system
and be invisible in sar and top. All overhead due to interrupts and
scheduling is invisible afaik.
Another example would be DMA transfers over your system or memory bus.
This would only show up as %iowait if a process is actually waiting to
read or write on an fd, it is not a direct indicator how busy the system
is with IO.
To stay on the %iowait topic, I wonder how much %iowait a process using
poll or select or even RT-signalling would cause since it only does IO
when it knows it won't block, the rest of the time, it just sleeps. And
that while saturating your disk or network (been there, done that).
Or even external causes, like not having a hardware-level dedicated
network for your heartbeat traffic so it cannot be interrupted by other
traffic on the wire or switch, etc, etc, etc.
J.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems