Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Serge Dubrouski Tue, 04 Jan 2011 08:22:24 -0800

On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov <[email protected]> wrote:
> Serge, I am not sure of anything, but the self-communication is supposed to
> be taking place on a single crossover cable between second network cards of
> the servers. (eth1).


Agree, yet something strange and pretty unique is going on with your
setup. Could you publish your ha.conf and outputs for ifconfig eth1
and netstat -in ?

>
> Igor
>
> On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski <[email protected]> wrote:
>
>> Are you sure that everything is all right with your network? It looks
>> like processes that are responsible for UDP communications are taking
>> too much of CPU time.
>>
>> On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote:
>> > Steve, here's some data.
>> >
>> > The OS is Ubuntu 10.04.
>> >
>> > ~# apt-cache policy heartbeat
>> > heartbeat:
>> >  Installed: 1:3.0.3-1ubuntu1
>> >  Candidate: 1:3.0.3-1ubuntu1
>> >  Version table:
>> >  *** 1:3.0.3-1ubuntu1 0
>> >        500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages
>> >        100 /var/lib/dpkg/status
>> >
>> > I agree that it should not use too much CPU, and I think that it does
>> not.
>> > But after a while it gets a SIGXCPU anyway.
>> >
>> > It also seems to die from something else.
>> >
>> > ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process
>> 1228
>> > killed by signal 24 [SIGXCPU - CPU limit exceeded].
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
>> > 1228 dumped core
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
>> >  Beginning communications restart process for comm channel 0.
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat closed on port 12694 interface eth1 - Status: 1
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
>> > 1227 killed by signal 9 [SIGKILL - Kill, unblockable].
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes
>> for
>> > channel 0 have died.  Restarting.
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat started on port 12694 (12694) interface eth1
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat closed on port 12694 interface eth1 - Status: 1
>> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications restart
>> > succeeded.
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process
>> > 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded].
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
>> > 6729 dumped core
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
>> >  Beginning communications restart process for comm channel 0.
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat closed on port 12694 interface eth1 - Status: 1
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
>> > 6728 killed by signal 9 [SIGKILL - Kill, unblockable].
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes
>> for
>> > channel 0 have died.  Restarting.
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat started on port 12694 (12694) interface eth1
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
>> > heartbeat closed on port 12694 interface eth1 - Status: 1
>> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications restart
>> > succeeded.
>> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown:
>> Master
>> > Control process died.
>> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 with
>> > SIGTERM
>> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 with
>> > SIGTERM
>> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 with
>> > SIGTERM
>> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown(MCP
>> > dead): Killing ourselves.
>> >
>> > i
>> >
>> > On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]>
>> wrote:
>> >
>> >> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote:
>> >> > Further reading indicates that heartbeat itself sets a limit for
>> itself
>> >> > every so often.
>> >> >
>> >> > Then it exceeds the limit (probably due to a bug). I am sure that
>> tha's
>> >> why
>> >> > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs.
>> >> >
>> >> > Then it dies with SIGXCPU, leaving everything in an extremely messy
>> >> state,
>> >> > leading to split brain, destruction of shared resources (DRBD data).
>> >> >
>> >> > I was trying to be a little patient. A little forgiving. I must say
>> that
>> >> my
>> >> > patience is rapidly running out.
>> >> >
>> >> > I absolutely cannot use this "solution" as a basis of a high
>> reliability
>> >> > cluster, because it is the opposite of reliability.
>> >> >
>> >> > We had an old cluster that works very well with heartbeat V1. But it
>> is
>> >> > getting old, the disks are wearing out, the fans are not getting
>> newer,
>> >> etc.
>> >> > I set up a new cluster in summer, but never fully trusted it, and it
>> >> looks
>> >> > like I will not be able to trust it. We never completed a switchover.
>> >> >
>> >> > At this point I feel rather desperate. Perhaps I should give
>> "pacemaker"
>> >> > another go. I really have no idea and I am running out of options.
>> >> >
>> >> > i
>> >> >
>> >> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]>
>> wrote:
>> >> >
>> >> >> A few weeks I reported that heartbeat died on one of the cluster
>> >> machines,
>> >> >> due to SIGXCPU.
>> >> >>
>> >> >> Well, it happened again. Heartbeat died, now both machines had the
>> >> shared
>> >> >> IP address up, what a god awful mess!!!
>> >> >>
>> >> >> Nopw they have split brain and the whole nine yards!
>> >> >>
>> >> >> I  looked at /proc/<heartbeat_pid>/limits and found:
>> >> >>
>> >> >> Limit                     Soft Limit           Hard Limit
>> >> Units
>> >> >>
>> >> >> Max cpu time              43                   unlimited
>> >>  seconds
>> >> >>
>> >> >>
>> >> >> So, this process somehow has a limit set for it.
>> >> >>
>> >> >> Does anyone have ANY clue who would set a limit for this process???
>> WTF?
>> >> >> Does it do it for itself or what?
>> >> >>
>> >>
>> >> I cannot answer your question, but I suspect it might be useful if you
>> >> mentioned which version of heartbeat and what resource manager you are
>> >> using. Perhaps provide a copy of your heartbeat configuration.
>> >>
>> >> Is heartbeat using too much CPU? It should be pretty much idle
>> >> relative to the rest of the system - If not, it is worth finding out
>> >> why not.
>> >>
>> >> Regards,
>> >> Steve
>> >> _______________________________________________
>> >> Linux-HA mailing list
>> >> [email protected]
>> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> >> See also: http://linux-ha.org/ReportingProblems
>> >>
>> > _______________________________________________
>> > Linux-HA mailing list
>> > [email protected]
>> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > See also: http://linux-ha.org/ReportingProblems
>> >
>>
>>
>>
>> --
>> Serge Dubrouski.
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Reply via email to