By the way... I am now restarting heartbeat every day on both nodes, at 22 hours on one node and at 23 hours on another. I hope that it will help it.
i On Fri, Jan 14, 2011 at 7:50 AM, Igor Chudov <[email protected]> wrote: > > > On Fri, Jan 14, 2011 at 3:50 AM, Lars Ellenberg <[email protected] > > wrote: > >> On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote: >> > On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg >> > <[email protected]>wrote: >> > >> > > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote: >> > > > Again, after about 3-4 days of running, heartbeat master process >> dies >> > > with >> > > > SIGXCPU. >> > > > >> > > > I was fortunate to run strace -p on it, so I captured strace. It >> looks >> > > like >> > > > boring, garden variety regular work, and then heartbeat dies with >> > > SIGXCPU. >> > > > The output is a bit lengthy. >> > > > >> > > > Is there some way to turn OFF the timeout on CPU? >> > > >> > > heartbeat sources, >> > > heartbeat/heartbeat.c, >> > > look out for cl_cpu_limit_setpercent >> > > which itself is defined in glue sources, >> > > glue/lib/clplumbing/cpulimits.c >> > > >> > > There the head comment block explains the intention of it: >> > > * This allows us to better catch runaway realtime processes that >> > > * might otherwise hang the whole system (if they're POSIX realtime >> > > * processes). >> > > * >> > > * We do this by getting a "lease" on CPU time, and then .... >> > > >> > > You could of course simply kill invokations of it. >> > > >> > > It would be interesting to know what heartbeat spends its cpu time on, >> > > though, so maybe you can try to profile it? >> > > >> > > It should usually not consume that much cpu. >> > > >> > > >> > Lars, in my uneducated opinion, the bug is in setting CPU limit >> incorrectly. >> > >> >> If you read the cpulimits.c head comment completely, >> you see that "every now and then" the user of this feature >> is supposed to call cl_cpu_limit_update(), >> which will extent the limit. >> And they all do with high priority in their mainloop, >> and I think at various other places as well. >> >> Exceeding the limit anyways means that it used up "too much" CPU time >> without returning to the mainloop. >> >> Right, or, not setting the limit properly. > > >> > I did watch heartbeat a little bit with ps | grep and its CPU use is >> very >> > low. >> >> Well, while you where watching, it did not receive the SIGXCPU, either, >> so there the limits have been good, right? ;-) >> >> Right > > >> But go ahead, and increase those limits, and retry. >> >> > I think that it is just a dumb, garden variety bug. >> >> Entirely possible. >> Would you track it down for us? >> >> > I would like to, yes, in fact, I have no choice. > > I am a computer programmer and troubleshooting is what I am paid to do. > > Will be glad to. > > i > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
