Hi, On Tue, Jan 27, 2009 at 11:52:15AM -0600, Dimitri Maziuk wrote: > Hi all, > > I'm trying to figure out why heartbeat shut itself down all of a sudden. > Here's the story: I have 2 Fedora 9 machines with heartbeat 2.1.3 -- > very basic config running apache, proftpd, and mon, with crossover > cable on 2nd NIC (bcast eth1).
But you should use the other media too. > This morning the powers that be replaced an upstream switch (not the > switch these boxes are hooked up to) and both nodes went down. According > to the logs (below) first node1 shut itself down and node2 took over. Two > minutes later node2 shut itself down, too. Not exactly helpful. > I don't see why they did that: there's no reason logged (very helpful), > neither eth1 nor eth0 were down. They couldn't ping the gateway, but > there are no ping groups defined. What am I missing? Looked through your logs: no idea what happened. It's as if somebody typed rcheartbeat stop. What about the configuration? Thanks, Dejan > TIA > Dima > > On node1: > --------------- > >Jan 27 06:06:30 moray heartbeat: [9240]: info: Heartbeat shutdown in > >progress. (9240) > >Jan 27 06:06:30 moray heartbeat: [28030]: info: Giving up all HA resources. > >Jan 27 06:06:30 moray ResourceManager[28043]: info: Releasing resource > >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon > >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running > >/etc/ha.d/resource.d/mon stop > >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running > >/etc/init.d/httpd stop > >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running > >/etc/init.d/proftpd stop > >Jan 27 06:06:31 moray ResourceManager[28043]: info: Running > >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop > >Jan 27 06:06:31 moray IPaddr[28172]: INFO: ifconfig eth0:0 down > >Jan 27 06:06:31 moray IPaddr[28155]: INFO: Success > >Jan 27 06:06:31 moray heartbeat: [28030]: info: All HA resources > >relinquished. > >Jan 27 06:06:31 moray heartbeat: [9240]: WARN: 1 lost packet(s) for > >[swordfish.bmrb.wisc.edu] [502959:502961] > >Jan 27 06:06:31 moray heartbeat: [9240]: info: No pkts missing from > >swordfish.bmrb.wisc.edu! > >Jan 27 06:06:33 moray ntpd[2363]: Deleting interface #10 eth0:0, > >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, > >active_time=397344 secs > >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBFIFO process 9242 > >with signal 15 > >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBWRITE process 9243 > >with signal 15 > >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBREAD process 9244 > >with signal 15 > >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9244 exited. 3 > >remaining > >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9242 exited. 2 > >remaining > >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9243 exited. 1 > >remaining > >Jan 27 06:06:33 moray heartbeat: [9240]: info: moray.bmrb.wisc.edu Heartbeat > >shutdown complete. > --------------- > > On node2: > --------------- > >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Received shutdown notice > >from 'moray.bmrb.wisc.edu'. > >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: acquire local HA > >resources (standby). > >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Resources being acquired > >from moray.bmrb.wisc.edu. > >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: local HA resource > >acquisition completed (standby). > >Jan 27 06:06:31 swordfish heartbeat: [5392]: info: No local resources > >[/usr/share/heartbeat/ResourceManager listkeys swordfish.bmrb.wisc.edu] to > >acquire. > >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Standby resource > >acquisition done [foreign]. > >Jan 27 06:06:31 swordfish harc[5416]: info: Running /etc/ha.d/rc.d/status > >status > >Jan 27 06:06:31 swordfish mach_down[5432]: info: Taking over resource group > >144.92.217.20 > >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Acquiring resource > >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon > >Jan 27 06:06:31 swordfish IPaddr[5485]: INFO: Resource is stopped > >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Running > >/etc/ha.d/resource.d/IPaddr 144.92.217.20 start > >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated nic for > >144.92.217.20: eth0 > >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated netmask for > >144.92.217.20: 255.255.255.192 > >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: eval ifconfig eth0:0 > >144.92.217.20 netmask 255.255.255.192 broadcast 144.92.217.63 > >Jan 27 06:06:32 swordfish IPaddr[5544]: INFO: Success > >Jan 27 06:06:32 swordfish ResourceManager[5458]: info: Running > >/etc/init.d/proftpd start > >Jan 27 06:06:42 swordfish heartbeat: [2778]: WARN: node moray.bmrb.wisc.edu: > >is dead > >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Dead node > >moray.bmrb.wisc.edu gave up resources. > >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Link > >moray.bmrb.wisc.edu:eth1 dead. > >Jan 27 06:07:12 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD > >1.3.1 (stable) (built Fri Jan 2 09:14:23 EST 2009) standalone mode STARTUP > >Jan 27 06:07:12 swordfish ResourceManager[5458]: info: Running > >/etc/init.d/httpd start > >Jan 27 06:07:13 swordfish ResourceManager[5458]: info: Running > >/etc/ha.d/resource.d/mon start > >Jan 27 06:07:13 swordfish mach_down[5432]: info: > >/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired > >Jan 27 06:07:13 swordfish mach_down[5432]: info: mach_down takeover complete > >for node moray.bmrb.wisc.edu. > ... > >Jan 27 06:09:14 swordfish heartbeat: [2778]: info: Heartbeat shutdown in > >progress. (2778) > >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: Giving up all HA > >resources. > >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Releasing resource > >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon > >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running > >/etc/ha.d/resource.d/mon stop > >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running > >/etc/init.d/httpd stop > >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running > >/etc/init.d/proftpd stop > >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD > >killed (signal 15) > >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD > >1.3.1 standalone mode SHUTDOWN > >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running > >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop > >Jan 27 06:09:14 swordfish IPaddr[5981]: INFO: ifconfig eth0:0 down > >Jan 27 06:09:14 swordfish IPaddr[5964]: INFO: Success > >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: All HA resources > >relinquished. > >Jan 27 06:09:16 swordfish ntpd[2383]: Deleting interface #9 eth0:0, > >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, > >active_time=163 secs > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBREAD process > >2805 with signal 15 > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBFIFO process > >2803 with signal 15 > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBWRITE process > >2804 with signal 15 > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2803 exited. > >3 remaining > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2805 exited. > >2 remaining > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2804 exited. > >1 remaining > >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: swordfish.bmrb.wisc.edu > >Heartbeat shutdown complete. > --------------- > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
