On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote: > On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg > <[email protected]>wrote: > > > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote: > > > Again, after about 3-4 days of running, heartbeat master process dies > > with > > > SIGXCPU. > > > > > > I was fortunate to run strace -p on it, so I captured strace. It looks > > like > > > boring, garden variety regular work, and then heartbeat dies with > > SIGXCPU. > > > The output is a bit lengthy. > > > > > > Is there some way to turn OFF the timeout on CPU? > > > > heartbeat sources, > > heartbeat/heartbeat.c, > > look out for cl_cpu_limit_setpercent > > which itself is defined in glue sources, > > glue/lib/clplumbing/cpulimits.c > > > > There the head comment block explains the intention of it: > > * This allows us to better catch runaway realtime processes that > > * might otherwise hang the whole system (if they're POSIX realtime > > * processes). > > * > > * We do this by getting a "lease" on CPU time, and then .... > > > > You could of course simply kill invokations of it. > > > > It would be interesting to know what heartbeat spends its cpu time on, > > though, so maybe you can try to profile it? > > > > It should usually not consume that much cpu. > > > > > Lars, in my uneducated opinion, the bug is in setting CPU limit incorrectly. >
If you read the cpulimits.c head comment completely, you see that "every now and then" the user of this feature is supposed to call cl_cpu_limit_update(), which will extent the limit. And they all do with high priority in their mainloop, and I think at various other places as well. Exceeding the limit anyways means that it used up "too much" CPU time without returning to the mainloop. > I did watch heartbeat a little bit with ps | grep and its CPU use is very > low. Well, while you where watching, it did not receive the SIGXCPU, either, so there the limits have been good, right? ;-) But go ahead, and increase those limits, and retry. > I think that it is just a dumb, garden variety bug. Entirely possible. Would you track it down for us? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
