On Mon, Oct 19, 2009 at 09:52:52AM -0700, Robinson, Eric wrote:
> > You are assuming sar and top tell you everything about 
> > your system and that is just plain wrong.
> 
> I'm not assuming that at all. What I am saying is, given that heartbeat
> is having trouble sending status updates for up to 15 seconds at a time,
> yet sar is apparently not capturing a possible cause (even when set to 1
> second intervals) does anyone have a suggestion for a better way to
> identify the cause?

I'd expect this to be some sort of "memory shortage" problem,
or shortage of memory of a certain "type",
or blocking on some or other socket operation.

the latter could be logging ;), so try using logd, not blocking syslog
or even worse, direct logging to files.


the former may be e.g. something in the networking stack needs buffer
memory, tries to allocate some, and that memory allocation calls out to
the VM write-out path, and needs to wait for some IO.


To improve the situation, I recommend to increase network buffers,
specifically the following sysctlt settings:

net.core.rmem_max = 1048576
net.core.wmem_max = 1048576

(or 10 MB, if you have enough ram).

if your kernel has those already:
up udp_rmem_min and udp_wmem_min.
careful, udp_mem is a differnet thing altogether,
not per socket, but by subsystem total, and not in bytes, but in pages.

rational: the U in udp is "unreliable".
the system is free to let packets fall onto the floor if
it "feels" unable to process them _now_.

heartbeat (and corosync, as well, correct me if I'm wrong) mainly
utilize udp for communications.

increasing buffer space for udp sockets does help to have the system

also, consider to increase "vm.min_free_kb".
It sometimes helps with "strange" resource problems, especially on
machines with plenty of RAM (which, for that very reason, can produce a
HUGE amount of dirty pages in no time, which then hog the write out
paths, which in turn may make it difficult for other things to get done
in time, especially with older kernels)


those network recommendations are mainly for lost heartbeats
respectively cluster comm messages, not for Gmain_timeout_dispatch
thingies per se.

so maybe upgrading to a more recent kernel + pacemaker
is the _real_ solution?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to