On Fri, Jan 14, 2011 at 3:50 AM, Lars Ellenberg
<[email protected]>wrote:

> On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote:
> > On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg
> > <[email protected]>wrote:
> >
> > > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote:
> > > > Again, after about 3-4 days of running, heartbeat master process dies
> > > with
> > > > SIGXCPU.
> > > >
> > > > I was fortunate to run strace -p on it, so I captured strace. It
> looks
> > > like
> > > > boring, garden variety regular work, and then heartbeat dies with
> > > SIGXCPU.
> > > > The output is a bit lengthy.
> > > >
> > > > Is there some way to turn OFF the timeout on CPU?
> > >
> > > heartbeat sources,
> > >  heartbeat/heartbeat.c,
> > >  look out for cl_cpu_limit_setpercent
> > > which itself is defined in glue sources,
> > >  glue/lib/clplumbing/cpulimits.c
> > >
> > > There the head comment block explains the intention of it:
> > >  * This allows us to better catch runaway realtime processes that
> > >  * might otherwise hang the whole system (if they're POSIX realtime
> > >  * processes).
> > >  *
> > >  * We do this by getting a "lease" on CPU time, and then ....
> > >
> > > You could of course simply kill invokations of it.
> > >
> > > It would be interesting to know what heartbeat spends its cpu time on,
> > > though, so maybe you can try to profile it?
> > >
> > > It should usually not consume that much cpu.
> > >
> > >
> > Lars, in my uneducated opinion, the bug is in setting CPU limit
> incorrectly.
> >
>
> If you read the cpulimits.c head comment completely,
> you see that "every now and then" the user of this feature
> is supposed to call cl_cpu_limit_update(),
> which will extent the limit.
> And they all do with high priority in their mainloop,
> and I think at various other places as well.
>
> Exceeding the limit anyways means that it used up "too much" CPU time
> without returning to the mainloop.
>
> Right, or, not setting the limit properly.


> > I did watch heartbeat a little bit with ps | grep and its CPU use is very
> > low.
>
> Well, while you where watching, it did not receive the SIGXCPU, either,
> so there the limits have been good, right? ;-)
>
> Right


> But go ahead, and increase those limits, and retry.
>
> > I think that it is just a dumb, garden variety bug.
>
> Entirely possible.
> Would you track it down for us?
>
>
I would like to, yes, in fact, I have no choice.

I am a computer programmer and troubleshooting is what I am paid to do.

Will be glad to.

i
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to