Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Serge Dubrouski Tue, 04 Jan 2011 08:06:25 -0800

Are you sure that everything is all right with your network? It looks
like processes that are responsible for UDP communications are taking
too much of CPU time.


On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote:
> Steve, here's some data.
>
> The OS is Ubuntu 10.04.
>
> ~# apt-cache policy heartbeat
> heartbeat:
>  Installed: 1:3.0.3-1ubuntu1
>  Candidate: 1:3.0.3-1ubuntu1
>  Version table:
>  *** 1:3.0.3-1ubuntu1 0
>        500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages
>        100 /var/lib/dpkg/status
>
> I agree that it should not use too much CPU, and I think that it does not.
> But after a while it gets a SIGXCPU anyway.
>
> It also seems to die from something else.
>
> ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process 1228
> killed by signal 24 [SIGXCPU - CPU limit exceeded].
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
> 1228 dumped core
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
>  Beginning communications restart process for comm channel 0.
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
> 1227 killed by signal 9 [SIGKILL - Kill, unblockable].
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes for
> channel 0 have died.  Restarting.
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat started on port 12694 (12694) interface eth1
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications restart
> succeeded.
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process
> 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
> 6729 dumped core
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
>  Beginning communications restart process for comm channel 0.
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
> 6728 killed by signal 9 [SIGKILL - Kill, unblockable].
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes for
> channel 0 have died.  Restarting.
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat started on port 12694 (12694) interface eth1
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications restart
> succeeded.
> Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown: Master
> Control process died.
> Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 with
> SIGTERM
> Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 with
> SIGTERM
> Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 with
> SIGTERM
> Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown(MCP
> dead): Killing ourselves.
>
> i
>
> On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]> wrote:
>
>> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote:
>> > Further reading indicates that heartbeat itself sets a limit for itself
>> > every so often.
>> >
>> > Then it exceeds the limit (probably due to a bug). I am sure that tha's
>> why
>> > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs.
>> >
>> > Then it dies with SIGXCPU, leaving everything in an extremely messy
>> state,
>> > leading to split brain, destruction of shared resources (DRBD data).
>> >
>> > I was trying to be a little patient. A little forgiving. I must say that
>> my
>> > patience is rapidly running out.
>> >
>> > I absolutely cannot use this "solution" as a basis of a high reliability
>> > cluster, because it is the opposite of reliability.
>> >
>> > We had an old cluster that works very well with heartbeat V1. But it is
>> > getting old, the disks are wearing out, the fans are not getting newer,
>> etc.
>> > I set up a new cluster in summer, but never fully trusted it, and it
>> looks
>> > like I will not be able to trust it. We never completed a switchover.
>> >
>> > At this point I feel rather desperate. Perhaps I should give "pacemaker"
>> > another go. I really have no idea and I am running out of options.
>> >
>> > i
>> >
>> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]> wrote:
>> >
>> >> A few weeks I reported that heartbeat died on one of the cluster
>> machines,
>> >> due to SIGXCPU.
>> >>
>> >> Well, it happened again. Heartbeat died, now both machines had the
>> shared
>> >> IP address up, what a god awful mess!!!
>> >>
>> >> Nopw they have split brain and the whole nine yards!
>> >>
>> >> I  looked at /proc/<heartbeat_pid>/limits and found:
>> >>
>> >> Limit                     Soft Limit           Hard Limit
>> Units
>> >>
>> >> Max cpu time              43                   unlimited
>>  seconds
>> >>
>> >>
>> >> So, this process somehow has a limit set for it.
>> >>
>> >> Does anyone have ANY clue who would set a limit for this process??? WTF?
>> >> Does it do it for itself or what?
>> >>
>>
>> I cannot answer your question, but I suspect it might be useful if you
>> mentioned which version of heartbeat and what resource manager you are
>> using. Perhaps provide a copy of your heartbeat configuration.
>>
>> Is heartbeat using too much CPU? It should be pretty much idle
>> relative to the rest of the system - If not, it is worth finding out
>> why not.
>>
>> Regards,
>> Steve
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Reply via email to