On Tue, Jan 4, 2011 at 10:22 AM, Serge Dubrouski <[email protected]> wrote:
> On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov <[email protected]> wrote: > > Serge, I am not sure of anything, but the self-communication is supposed > to > > be taking place on a single crossover cable between second network cards > of > > the servers. (eth1). > > Agree, yet something strange and pretty unique is going on with your > setup. Could you publish your ha.conf and outputs for ifconfig eth1 > and netstat -in ? > > It happened again. This time all I know from logs is that MCP died. My first question that I want answered regardless of anything, is how to enable dumping cores and debugging the crash. My second question is, can heartbeat be configured to restart itself in case of such a failure. My version is 3.0.3. Anyway, here is the conf file. use_logd on udpport 12694 keepalive 1 warntime 15 deadtime 20 debug 1 initdead 30 bcast eth1 node pfs-srv3 node pfs-srv4 auto_failback on crm off > > > Igor > > > > On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski <[email protected]> > wrote: > > > >> Are you sure that everything is all right with your network? It looks > >> like processes that are responsible for UDP communications are taking > >> too much of CPU time. > >> > >> On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote: > >> > Steve, here's some data. > >> > > >> > The OS is Ubuntu 10.04. > >> > > >> > ~# apt-cache policy heartbeat > >> > heartbeat: > >> > Installed: 1:3.0.3-1ubuntu1 > >> > Candidate: 1:3.0.3-1ubuntu1 > >> > Version table: > >> > *** 1:3.0.3-1ubuntu1 0 > >> > 500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe > Packages > >> > 100 /var/lib/dpkg/status > >> > > >> > I agree that it should not use too much CPU, and I think that it does > >> not. > >> > But after a while it gets a SIGXCPU anyway. > >> > > >> > It also seems to die from something else. > >> > > >> > ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD > process > >> 1228 > >> > killed by signal 24 [SIGXCPU - CPU limit exceeded]. > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD > process > >> > 1228 dumped core > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process > died. > >> > Beginning communications restart process for comm channel 0. > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat closed on port 12694 interface eth1 - Status: 1 > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE > process > >> > 1227 killed by signal 9 [SIGKILL - Kill, unblockable]. > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes > >> for > >> > channel 0 have died. Restarting. > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat started on port 12694 (12694) interface eth1 > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat closed on port 12694 interface eth1 - Status: 1 > >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications > restart > >> > succeeded. > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD > process > >> > 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded]. > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD > process > >> > 6729 dumped core > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process > died. > >> > Beginning communications restart process for comm channel 0. > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat closed on port 12694 interface eth1 - Status: 1 > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE > process > >> > 6728 killed by signal 9 [SIGKILL - Kill, unblockable]. > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes > >> for > >> > channel 0 have died. Restarting. > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat started on port 12694 (12694) interface eth1 > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast > >> > heartbeat closed on port 12694 interface eth1 - Status: 1 > >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications > restart > >> > succeeded. > >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown: > >> Master > >> > Control process died. > >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 > with > >> > SIGTERM > >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 > with > >> > SIGTERM > >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 > with > >> > SIGTERM > >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency > Shutdown(MCP > >> > dead): Killing ourselves. > >> > > >> > i > >> > > >> > On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]> > >> wrote: > >> > > >> >> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote: > >> >> > Further reading indicates that heartbeat itself sets a limit for > >> itself > >> >> > every so often. > >> >> > > >> >> > Then it exceeds the limit (probably due to a bug). I am sure that > >> tha's > >> >> why > >> >> > whoever wrote heartbeat, set cpu limit, instead of foxing their > bugs. > >> >> > > >> >> > Then it dies with SIGXCPU, leaving everything in an extremely messy > >> >> state, > >> >> > leading to split brain, destruction of shared resources (DRBD > data). > >> >> > > >> >> > I was trying to be a little patient. A little forgiving. I must say > >> that > >> >> my > >> >> > patience is rapidly running out. > >> >> > > >> >> > I absolutely cannot use this "solution" as a basis of a high > >> reliability > >> >> > cluster, because it is the opposite of reliability. > >> >> > > >> >> > We had an old cluster that works very well with heartbeat V1. But > it > >> is > >> >> > getting old, the disks are wearing out, the fans are not getting > >> newer, > >> >> etc. > >> >> > I set up a new cluster in summer, but never fully trusted it, and > it > >> >> looks > >> >> > like I will not be able to trust it. We never completed a > switchover. > >> >> > > >> >> > At this point I feel rather desperate. Perhaps I should give > >> "pacemaker" > >> >> > another go. I really have no idea and I am running out of options. > >> >> > > >> >> > i > >> >> > > >> >> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]> > >> wrote: > >> >> > > >> >> >> A few weeks I reported that heartbeat died on one of the cluster > >> >> machines, > >> >> >> due to SIGXCPU. > >> >> >> > >> >> >> Well, it happened again. Heartbeat died, now both machines had the > >> >> shared > >> >> >> IP address up, what a god awful mess!!! > >> >> >> > >> >> >> Nopw they have split brain and the whole nine yards! > >> >> >> > >> >> >> I looked at /proc/<heartbeat_pid>/limits and found: > >> >> >> > >> >> >> Limit Soft Limit Hard Limit > >> >> Units > >> >> >> > >> >> >> Max cpu time 43 unlimited > >> >> seconds > >> >> >> > >> >> >> > >> >> >> So, this process somehow has a limit set for it. > >> >> >> > >> >> >> Does anyone have ANY clue who would set a limit for this > process??? > >> WTF? > >> >> >> Does it do it for itself or what? > >> >> >> > >> >> > >> >> I cannot answer your question, but I suspect it might be useful if > you > >> >> mentioned which version of heartbeat and what resource manager you > are > >> >> using. Perhaps provide a copy of your heartbeat configuration. > >> >> > >> >> Is heartbeat using too much CPU? It should be pretty much idle > >> >> relative to the rest of the system - If not, it is worth finding out > >> >> why not. > >> >> > >> >> Regards, > >> >> Steve > >> >> _______________________________________________ > >> >> Linux-HA mailing list > >> >> [email protected] > >> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> >> See also: http://linux-ha.org/ReportingProblems > >> >> > >> > _______________________________________________ > >> > Linux-HA mailing list > >> > [email protected] > >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> > See also: http://linux-ha.org/ReportingProblems > >> > > >> > >> > >> > >> -- > >> Serge Dubrouski. > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > >> > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > -- > Serge Dubrouski. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
