Re: [Linux-HA] Another crash of heartbeat with SIGXCPU -- now I have strace!

Igor Chudov Fri, 14 Jan 2011 08:03:38 -0800

By the way... I am now restarting heartbeat every day on both nodes, at 22
hours on one node and at 23 hours on another. I hope that it will help it.


i


On Fri, Jan 14, 2011 at 7:50 AM, Igor Chudov <[email protected]> wrote:

>
>
> On Fri, Jan 14, 2011 at 3:50 AM, Lars Ellenberg <[email protected]
> > wrote:
>
>> On Thu, Jan 13, 2011 at 01:14:58PM -0600, Igor Chudov wrote:
>> > On Thu, Jan 13, 2011 at 10:55 AM, Lars Ellenberg
>> > <[email protected]>wrote:
>> >
>> > > On Thu, Jan 13, 2011 at 10:17:40AM -0600, Igor Chudov wrote:
>> > > > Again, after about 3-4 days of running, heartbeat master process
>> dies
>> > > with
>> > > > SIGXCPU.
>> > > >
>> > > > I was fortunate to run strace -p on it, so I captured strace. It
>> looks
>> > > like
>> > > > boring, garden variety regular work, and then heartbeat dies with
>> > > SIGXCPU.
>> > > > The output is a bit lengthy.
>> > > >
>> > > > Is there some way to turn OFF the timeout on CPU?
>> > >
>> > > heartbeat sources,
>> > >  heartbeat/heartbeat.c,
>> > >  look out for cl_cpu_limit_setpercent
>> > > which itself is defined in glue sources,
>> > >  glue/lib/clplumbing/cpulimits.c
>> > >
>> > > There the head comment block explains the intention of it:
>> > >  * This allows us to better catch runaway realtime processes that
>> > >  * might otherwise hang the whole system (if they're POSIX realtime
>> > >  * processes).
>> > >  *
>> > >  * We do this by getting a "lease" on CPU time, and then ....
>> > >
>> > > You could of course simply kill invokations of it.
>> > >
>> > > It would be interesting to know what heartbeat spends its cpu time on,
>> > > though, so maybe you can try to profile it?
>> > >
>> > > It should usually not consume that much cpu.
>> > >
>> > >
>> > Lars, in my uneducated opinion, the bug is in setting CPU limit
>> incorrectly.
>> >
>>
>> If you read the cpulimits.c head comment completely,
>> you see that "every now and then" the user of this feature
>> is supposed to call cl_cpu_limit_update(),
>> which will extent the limit.
>> And they all do with high priority in their mainloop,
>> and I think at various other places as well.
>>
>> Exceeding the limit anyways means that it used up "too much" CPU time
>> without returning to the mainloop.
>>
>> Right, or, not setting the limit properly.
>
>
>>  > I did watch heartbeat a little bit with ps | grep and its CPU use is
>> very
>> > low.
>>
>> Well, while you where watching, it did not receive the SIGXCPU, either,
>> so there the limits have been good, right? ;-)
>>
>> Right
>
>
>> But go ahead, and increase those limits, and retry.
>>
>> > I think that it is just a dumb, garden variety bug.
>>
>> Entirely possible.
>> Would you track it down for us?
>>
>>
> I would like to, yes, in fact, I have no choice.
>
> I am a computer programmer and troubleshooting is what I am paid to do.
>
> Will be glad to.
>
> i
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Another crash of heartbeat with SIGXCPU -- now I have strace!

Reply via email to