Serge, I am not sure of anything, but the self-communication is supposed to be taking place on a single crossover cable between second network cards of the servers. (eth1).
Igor On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski <[email protected]> wrote: > Are you sure that everything is all right with your network? It looks > like processes that are responsible for UDP communications are taking > too much of CPU time. > > On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote: > > Steve, here's some data. > > > > The OS is Ubuntu 10.04. > > > > ~# apt-cache policy heartbeat > > heartbeat: > > Installed: 1:3.0.3-1ubuntu1 > > Candidate: 1:3.0.3-1ubuntu1 > > Version table: > > *** 1:3.0.3-1ubuntu1 0 > > 500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages > > 100 /var/lib/dpkg/status > > > > I agree that it should not use too much CPU, and I think that it does > not. > > But after a while it gets a SIGXCPU anyway. > > > > It also seems to die from something else. > > > > ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process > 1228 > > killed by signal 24 [SIGXCPU - CPU limit exceeded]. > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process > > 1228 dumped core > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died. > > Beginning communications restart process for comm channel 0. > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat closed on port 12694 interface eth1 - Status: 1 > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process > > 1227 killed by signal 9 [SIGKILL - Kill, unblockable]. > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes > for > > channel 0 have died. Restarting. > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat started on port 12694 (12694) interface eth1 > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat closed on port 12694 interface eth1 - Status: 1 > > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications restart > > succeeded. > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process > > 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded]. > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process > > 6729 dumped core > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died. > > Beginning communications restart process for comm channel 0. > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat closed on port 12694 interface eth1 - Status: 1 > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process > > 6728 killed by signal 9 [SIGKILL - Kill, unblockable]. > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes > for > > channel 0 have died. Restarting. > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat started on port 12694 (12694) interface eth1 > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > > heartbeat closed on port 12694 interface eth1 - Status: 1 > > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications restart > > succeeded. > > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown: > Master > > Control process died. > > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 with > > SIGTERM > > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 with > > SIGTERM > > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 with > > SIGTERM > > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown(MCP > > dead): Killing ourselves. > > > > i > > > > On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]> > wrote: > > > >> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote: > >> > Further reading indicates that heartbeat itself sets a limit for > itself > >> > every so often. > >> > > >> > Then it exceeds the limit (probably due to a bug). I am sure that > tha's > >> why > >> > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs. > >> > > >> > Then it dies with SIGXCPU, leaving everything in an extremely messy > >> state, > >> > leading to split brain, destruction of shared resources (DRBD data). > >> > > >> > I was trying to be a little patient. A little forgiving. I must say > that > >> my > >> > patience is rapidly running out. > >> > > >> > I absolutely cannot use this "solution" as a basis of a high > reliability > >> > cluster, because it is the opposite of reliability. > >> > > >> > We had an old cluster that works very well with heartbeat V1. But it > is > >> > getting old, the disks are wearing out, the fans are not getting > newer, > >> etc. > >> > I set up a new cluster in summer, but never fully trusted it, and it > >> looks > >> > like I will not be able to trust it. We never completed a switchover. > >> > > >> > At this point I feel rather desperate. Perhaps I should give > "pacemaker" > >> > another go. I really have no idea and I am running out of options. > >> > > >> > i > >> > > >> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]> > wrote: > >> > > >> >> A few weeks I reported that heartbeat died on one of the cluster > >> machines, > >> >> due to SIGXCPU. > >> >> > >> >> Well, it happened again. Heartbeat died, now both machines had the > >> shared > >> >> IP address up, what a god awful mess!!! > >> >> > >> >> Nopw they have split brain and the whole nine yards! > >> >> > >> >> I looked at /proc/<heartbeat_pid>/limits and found: > >> >> > >> >> Limit Soft Limit Hard Limit > >> Units > >> >> > >> >> Max cpu time 43 unlimited > >> seconds > >> >> > >> >> > >> >> So, this process somehow has a limit set for it. > >> >> > >> >> Does anyone have ANY clue who would set a limit for this process??? > WTF? > >> >> Does it do it for itself or what? > >> >> > >> > >> I cannot answer your question, but I suspect it might be useful if you > >> mentioned which version of heartbeat and what resource manager you are > >> using. Perhaps provide a copy of your heartbeat configuration. > >> > >> Is heartbeat using too much CPU? It should be pretty much idle > >> relative to the rest of the system - If not, it is worth finding out > >> why not. > >> > >> Regards, > >> Steve > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > >> > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > -- > Serge Dubrouski. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
