On Fri, Jan 14, 2011 at 3:50 AM, Lars Ellenberg <[email protected]>wrote:
> On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote: > > On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg > > <[email protected]>wrote: > > > > > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote: > > > > Again, after about 3-4 days of running, heartbeat master process dies > > > with > > > > SIGXCPU. > > > > > > > > I was fortunate to run strace -p on it, so I captured strace. It > looks > > > like > > > > boring, garden variety regular work, and then heartbeat dies with > > > SIGXCPU. > > > > The output is a bit lengthy. > > > > > > > > Is there some way to turn OFF the timeout on CPU? > > > > > > heartbeat sources, > > > heartbeat/heartbeat.c, > > > look out for cl_cpu_limit_setpercent > > > which itself is defined in glue sources, > > > glue/lib/clplumbing/cpulimits.c > > > > > > There the head comment block explains the intention of it: > > > * This allows us to better catch runaway realtime processes that > > > * might otherwise hang the whole system (if they're POSIX realtime > > > * processes). > > > * > > > * We do this by getting a "lease" on CPU time, and then .... > > > > > > You could of course simply kill invokations of it. > > > > > > It would be interesting to know what heartbeat spends its cpu time on, > > > though, so maybe you can try to profile it? > > > > > > It should usually not consume that much cpu. > > > > > > > > Lars, in my uneducated opinion, the bug is in setting CPU limit > incorrectly. > > > > If you read the cpulimits.c head comment completely, > you see that "every now and then" the user of this feature > is supposed to call cl_cpu_limit_update(), > which will extent the limit. > And they all do with high priority in their mainloop, > and I think at various other places as well. > > Exceeding the limit anyways means that it used up "too much" CPU time > without returning to the mainloop. > > Right, or, not setting the limit properly. > > I did watch heartbeat a little bit with ps | grep and its CPU use is very > > low. > > Well, while you where watching, it did not receive the SIGXCPU, either, > so there the limits have been good, right? ;-) > > Right > But go ahead, and increase those limits, and retry. > > > I think that it is just a dumb, garden variety bug. > > Entirely possible. > Would you track it down for us? > > I would like to, yes, in fact, I have no choice. I am a computer programmer and troubleshooting is what I am paid to do. Will be glad to. i _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
