Hi,

On Tue, Jan 27, 2009 at 11:52:15AM -0600, Dimitri Maziuk wrote:
> Hi all,
> 
> I'm trying to figure out why heartbeat shut itself down all of a sudden.
> Here's the story: I have 2 Fedora 9 machines with heartbeat 2.1.3 -- 
> very basic config running apache, proftpd, and mon, with crossover
> cable on 2nd NIC (bcast eth1).

But you should use the other media too.

> This morning the powers that be replaced an upstream switch (not the
> switch these boxes are hooked up to) and both nodes went down. According
> to the logs (below) first node1 shut itself down and node2 took over. Two
> minutes later node2 shut itself down, too.

Not exactly helpful.

> I don't see why they did that: there's no reason logged (very helpful),
> neither eth1 nor eth0 were down. They couldn't ping the gateway, but
> there are no ping groups defined. What am I missing?

Looked through your logs: no idea what happened. It's as if
somebody typed rcheartbeat stop. What about the configuration?

Thanks,

Dejan

> TIA
> Dima
> 
> On node1:
> ---------------
> >Jan 27 06:06:30 moray heartbeat: [9240]: info: Heartbeat shutdown in 
> >progress. (9240)
> >Jan 27 06:06:30 moray heartbeat: [28030]: info: Giving up all HA resources.
> >Jan 27 06:06:30 moray ResourceManager[28043]: info: Releasing resource 
> >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon
> >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running 
> >/etc/ha.d/resource.d/mon  stop
> >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running 
> >/etc/init.d/httpd  stop
> >Jan 27 06:06:30 moray ResourceManager[28043]: info: Running 
> >/etc/init.d/proftpd  stop
> >Jan 27 06:06:31 moray ResourceManager[28043]: info: Running 
> >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop
> >Jan 27 06:06:31 moray IPaddr[28172]: INFO: ifconfig eth0:0 down
> >Jan 27 06:06:31 moray IPaddr[28155]: INFO:  Success
> >Jan 27 06:06:31 moray heartbeat: [28030]: info: All HA resources 
> >relinquished.
> >Jan 27 06:06:31 moray heartbeat: [9240]: WARN: 1 lost packet(s) for 
> >[swordfish.bmrb.wisc.edu] [502959:502961]
> >Jan 27 06:06:31 moray heartbeat: [9240]: info: No pkts missing from 
> >swordfish.bmrb.wisc.edu!
> >Jan 27 06:06:33 moray ntpd[2363]: Deleting interface #10 eth0:0, 
> >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, 
> >active_time=397344 secs
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBFIFO process 9242 
> >with signal 15
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBWRITE process 9243 
> >with signal 15
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: killing HBREAD process 9244 
> >with signal 15
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9244 exited. 3 
> >remaining
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9242 exited. 2 
> >remaining
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: Core process 9243 exited. 1 
> >remaining
> >Jan 27 06:06:33 moray heartbeat: [9240]: info: moray.bmrb.wisc.edu Heartbeat 
> >shutdown complete.
> ---------------
> 
> On node2:
> ---------------
> >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Received shutdown notice 
> >from 'moray.bmrb.wisc.edu'.
> >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: acquire local HA 
> >resources (standby).
> >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Resources being acquired 
> >from moray.bmrb.wisc.edu.
> >Jan 27 06:06:31 swordfish heartbeat: [5390]: info: local HA resource 
> >acquisition completed (standby).
> >Jan 27 06:06:31 swordfish heartbeat: [5392]: info: No local resources 
> >[/usr/share/heartbeat/ResourceManager listkeys swordfish.bmrb.wisc.edu] to 
> >acquire.
> >Jan 27 06:06:31 swordfish heartbeat: [2778]: info: Standby resource 
> >acquisition done [foreign].
> >Jan 27 06:06:31 swordfish harc[5416]: info: Running /etc/ha.d/rc.d/status 
> >status
> >Jan 27 06:06:31 swordfish mach_down[5432]: info: Taking over resource group 
> >144.92.217.20
> >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Acquiring resource 
> >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon
> >Jan 27 06:06:31 swordfish IPaddr[5485]: INFO:  Resource is stopped
> >Jan 27 06:06:31 swordfish ResourceManager[5458]: info: Running 
> >/etc/ha.d/resource.d/IPaddr 144.92.217.20 start
> >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated nic for 
> >144.92.217.20: eth0
> >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: Using calculated netmask for 
> >144.92.217.20: 255.255.255.192
> >Jan 27 06:06:31 swordfish IPaddr[5561]: INFO: eval ifconfig eth0:0 
> >144.92.217.20 netmask 255.255.255.192 broadcast 144.92.217.63
> >Jan 27 06:06:32 swordfish IPaddr[5544]: INFO:  Success
> >Jan 27 06:06:32 swordfish ResourceManager[5458]: info: Running 
> >/etc/init.d/proftpd  start
> >Jan 27 06:06:42 swordfish heartbeat: [2778]: WARN: node moray.bmrb.wisc.edu: 
> >is dead
> >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Dead node 
> >moray.bmrb.wisc.edu gave up resources.
> >Jan 27 06:06:42 swordfish heartbeat: [2778]: info: Link 
> >moray.bmrb.wisc.edu:eth1 dead.
> >Jan 27 06:07:12 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD 
> >1.3.1 (stable) (built Fri Jan 2 09:14:23 EST 2009) standalone mode STARTUP
> >Jan 27 06:07:12 swordfish ResourceManager[5458]: info: Running 
> >/etc/init.d/httpd  start
> >Jan 27 06:07:13 swordfish ResourceManager[5458]: info: Running 
> >/etc/ha.d/resource.d/mon  start
> >Jan 27 06:07:13 swordfish mach_down[5432]: info: 
> >/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
> >Jan 27 06:07:13 swordfish mach_down[5432]: info: mach_down takeover complete 
> >for node moray.bmrb.wisc.edu.
> ...
> >Jan 27 06:09:14 swordfish heartbeat: [2778]: info: Heartbeat shutdown in 
> >progress. (2778)
> >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: Giving up all HA 
> >resources.
> >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Releasing resource 
> >group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon
> >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
> >/etc/ha.d/resource.d/mon  stop
> >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
> >/etc/init.d/httpd  stop
> >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
> >/etc/init.d/proftpd  stop
> >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD 
> >killed (signal 15)
> >Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD 
> >1.3.1 standalone mode SHUTDOWN
> >Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
> >/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop
> >Jan 27 06:09:14 swordfish IPaddr[5981]: INFO: ifconfig eth0:0 down
> >Jan 27 06:09:14 swordfish IPaddr[5964]: INFO:  Success
> >Jan 27 06:09:14 swordfish heartbeat: [5843]: info: All HA resources 
> >relinquished.
> >Jan 27 06:09:16 swordfish ntpd[2383]: Deleting interface #9 eth0:0, 
> >144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, 
> >active_time=163 secs
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBREAD process 
> >2805 with signal 15
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBFIFO process 
> >2803 with signal 15
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBWRITE process 
> >2804 with signal 15
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2803 exited. 
> >3 remaining
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2805 exited. 
> >2 remaining
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2804 exited. 
> >1 remaining
> >Jan 27 06:09:16 swordfish heartbeat: [2778]: info: swordfish.bmrb.wisc.edu 
> >Heartbeat shutdown complete.
> ---------------
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to