Hi all, I'm trying to figure out why heartbeat shut itself down all of a sudden. Here's the story: I have 2 Fedora 9 machines with heartbeat 2.1.3 -- very basic config running apache, proftpd, and mon, with crossover cable on 2nd NIC (bcast eth1).
This morning the powers that be replaced an upstream switch (not the switch these boxes are hooked up to) and both nodes went down. According to the logs (below) first node1 shut itself down and node2 took over. Two minutes later node2 shut itself down, too. I don't see why they did that: there's no reason logged (very helpful), neither eth1 nor eth0 were down. They couldn't ping the gateway, but there are no ping groups defined. What am I missing? TIA Dima On node1: --------------- >Jan 27 06:06:30 moray heartbeat: [9240]: info: Heartbeat shutdown in progress. >(9240) >Jan 27 06:06:30 moray heartbeat: [28030]: info: Giving up all HA resources. >Jan 27 06:06:30 moray ResourceManager[28043]: info: Releasing resource group: >moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running >/etc/ha.d/resource.d/mon stop >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running /etc/init.d/httpd >stop >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running >/etc/init.d/proftpd stop >Jan 27 06:06:31 moray ResourceManager[28043]: info: Running >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop >Jan 27 06:06:31 moray IPaddr[28172]: INFO: ifconfig eth0:0 down >Jan 27 06:06:31 moray IPaddr[28155]: INFO: Success >Jan 27 06:06:31 moray heartbeat: [28030]: info: All HA resources relinquished. >Jan 27 06:06:31 moray heartbeat: [9240]: WARN: 1 lost packet(s) for >[swordfish.bmrb.wisc.edu] [502959:502961] >Jan 27 06:06:31 moray heartbeat: [9240]: info: No pkts missing from >swordfish.bmrb.wisc.edu! >Jan 27 06:06:33 moray ntpd[2363]: Deleting interface #10 eth0:0, >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, >active_time=397344 secs >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBFIFO process 9242 >with signal 15 >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBWRITE process 9243 >with signal 15 >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBREAD process 9244 >with signal 15 >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9244 exited. 3 >remaining >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9242 exited. 2 >remaining >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9243 exited. 1 >remaining >Jan 27 06:06:33 moray heartbeat: [9240]: info: moray.bmrb.wisc.edu Heartbeat >shutdown complete. --------------- On node2: --------------- >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Received shutdown notice >from 'moray.bmrb.wisc.edu'. >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: acquire local HA resources >(standby). >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Resources being acquired >from moray.bmrb.wisc.edu. >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: local HA resource >acquisition completed (standby). >Jan 27 06:06:31 swordfish heartbeat: [5392]: info: No local resources >[/usr/share/heartbeat/ResourceManager listkeys swordfish.bmrb.wisc.edu] to >acquire. >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Standby resource >acquisition done [foreign]. >Jan 27 06:06:31 swordfish harc[5416]: info: Running /etc/ha.d/rc.d/status >status >Jan 27 06:06:31 swordfish mach_down[5432]: info: Taking over resource group >144.92.217.20 >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Acquiring resource >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon >Jan 27 06:06:31 swordfish IPaddr[5485]: INFO: Resource is stopped >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Running >/etc/ha.d/resource.d/IPaddr 144.92.217.20 start >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated nic for >144.92.217.20: eth0 >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated netmask for >144.92.217.20: 255.255.255.192 >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: eval ifconfig eth0:0 >144.92.217.20 netmask 255.255.255.192 broadcast 144.92.217.63 >Jan 27 06:06:32 swordfish IPaddr[5544]: INFO: Success >Jan 27 06:06:32 swordfish ResourceManager[5458]: info: Running >/etc/init.d/proftpd start >Jan 27 06:06:42 swordfish heartbeat: [2778]: WARN: node moray.bmrb.wisc.edu: >is dead >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Dead node >moray.bmrb.wisc.edu gave up resources. >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Link >moray.bmrb.wisc.edu:eth1 dead. >Jan 27 06:07:12 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD >1.3.1 (stable) (built Fri Jan 2 09:14:23 EST 2009) standalone mode STARTUP >Jan 27 06:07:12 swordfish ResourceManager[5458]: info: Running >/etc/init.d/httpd start >Jan 27 06:07:13 swordfish ResourceManager[5458]: info: Running >/etc/ha.d/resource.d/mon start >Jan 27 06:07:13 swordfish mach_down[5432]: info: >/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired >Jan 27 06:07:13 swordfish mach_down[5432]: info: mach_down takeover complete >for node moray.bmrb.wisc.edu. ... >Jan 27 06:09:14 swordfish heartbeat: [2778]: info: Heartbeat shutdown in >progress. (2778) >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: Giving up all HA resources. >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Releasing resource >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running >/etc/ha.d/resource.d/mon stop >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running >/etc/init.d/httpd stop >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running >/etc/init.d/proftpd stop >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD >killed (signal 15) >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD >1.3.1 standalone mode SHUTDOWN >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop >Jan 27 06:09:14 swordfish IPaddr[5981]: INFO: ifconfig eth0:0 down >Jan 27 06:09:14 swordfish IPaddr[5964]: INFO: Success >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: All HA resources >relinquished. >Jan 27 06:09:16 swordfish ntpd[2383]: Deleting interface #9 eth0:0, >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, >active_time=163 secs >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBREAD process 2805 >with signal 15 >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBFIFO process 2803 >with signal 15 >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBWRITE process >2804 with signal 15 >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2803 exited. 3 >remaining >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2805 exited. 2 >remaining >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2804 exited. 1 >remaining >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: swordfish.bmrb.wisc.edu >Heartbeat shutdown complete. --------------- -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
