On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote:
> On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg
> <[email protected]>wrote:
> 
> > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote:
> > > Again, after about 3-4 days of running, heartbeat master process dies
> > with
> > > SIGXCPU.
> > >
> > > I was fortunate to run strace -p on it, so I captured strace. It looks
> > like
> > > boring, garden variety regular work, and then heartbeat dies with
> > SIGXCPU.
> > > The output is a bit lengthy.
> > >
> > > Is there some way to turn OFF the timeout on CPU?
> >
> > heartbeat sources,
> >  heartbeat/heartbeat.c,
> >  look out for cl_cpu_limit_setpercent
> > which itself is defined in glue sources,
> >  glue/lib/clplumbing/cpulimits.c
> >
> > There the head comment block explains the intention of it:
> >  * This allows us to better catch runaway realtime processes that
> >  * might otherwise hang the whole system (if they're POSIX realtime
> >  * processes).
> >  *
> >  * We do this by getting a "lease" on CPU time, and then ....
> >
> > You could of course simply kill invokations of it.
> >
> > It would be interesting to know what heartbeat spends its cpu time on,
> > though, so maybe you can try to profile it?
> >
> > It should usually not consume that much cpu.
> >
> >
> Lars, in my uneducated opinion, the bug is in setting CPU limit incorrectly.
> 

If you read the cpulimits.c head comment completely,
you see that "every now and then" the user of this feature
is supposed to call cl_cpu_limit_update(),
which will extent the limit.
And they all do with high priority in their mainloop,
and I think at various other places as well.

Exceeding the limit anyways means that it used up "too much" CPU time
without returning to the mainloop.

> I did watch heartbeat a little bit with ps | grep and its CPU use is very
> low.

Well, while you where watching, it did not receive the SIGXCPU, either,
so there the limits have been good, right? ;-)

But go ahead, and increase those limits, and retry.

> I think that it is just a dumb, garden variety bug.

Entirely possible.
Would you track it down for us?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to