On Mon, Oct 19, 2009 at 09:52:52AM -0700, Robinson, Eric wrote: > > You are assuming sar and top tell you everything about > > your system and that is just plain wrong. > > I'm not assuming that at all. What I am saying is, given that heartbeat > is having trouble sending status updates for up to 15 seconds at a time, > yet sar is apparently not capturing a possible cause (even when set to 1 > second intervals) does anyone have a suggestion for a better way to > identify the cause?
I'd expect this to be some sort of "memory shortage" problem, or shortage of memory of a certain "type", or blocking on some or other socket operation. the latter could be logging ;), so try using logd, not blocking syslog or even worse, direct logging to files. the former may be e.g. something in the networking stack needs buffer memory, tries to allocate some, and that memory allocation calls out to the VM write-out path, and needs to wait for some IO. To improve the situation, I recommend to increase network buffers, specifically the following sysctlt settings: net.core.rmem_max = 1048576 net.core.wmem_max = 1048576 (or 10 MB, if you have enough ram). if your kernel has those already: up udp_rmem_min and udp_wmem_min. careful, udp_mem is a differnet thing altogether, not per socket, but by subsystem total, and not in bytes, but in pages. rational: the U in udp is "unreliable". the system is free to let packets fall onto the floor if it "feels" unable to process them _now_. heartbeat (and corosync, as well, correct me if I'm wrong) mainly utilize udp for communications. increasing buffer space for udp sockets does help to have the system also, consider to increase "vm.min_free_kb". It sometimes helps with "strange" resource problems, especially on machines with plenty of RAM (which, for that very reason, can produce a HUGE amount of dirty pages in no time, which then hog the write out paths, which in turn may make it difficult for other things to get done in time, especially with older kernels) those network recommendations are mainly for lost heartbeats respectively cluster comm messages, not for Gmain_timeout_dispatch thingies per se. so maybe upgrading to a more recent kernel + pacemaker is the _real_ solution? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
