We recently had a split brain incident, where after reboot a bonded interface did not come up (a security guard pressed the emergency power off button the in the data center... and as far as we can tell this fried the Intel nics connecting the two servers).
As a result I'm looking for better ways to audit that each of our (now) redundant heartbeat links are in fact up. How is this best done? All I really see in the heartbeat logs are memory stats: Jun 1 22:01:04 odin heartbeat: [916]: info: MSG stats: 0/0 ms age 20643855300 [pid6231/HBREAD] Jun 1 22:01:04 odin heartbeat: [916]: info: ha_malloc stats: 500/816 62512/31219 [pid6231/HBREAD] Jun 1 22:01:04 odin heartbeat: [916]: info: RealMalloc stats: 63200 total malloc bytes. pid [6231/HBREAD] Jun 1 22:01:04 odin heartbeat: [916]: info: These are nothing to worry about. Where I'd really want to see the status of each ha.cf configured link as it goes up or down. Is something like this possible? Jun 1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 down Jun 1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 down 250 times Jun 1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 up ----------------------------------------------------------------------------------------------------- logfacility local2 # local0=syslog.core local2=/var/log/ha.log -BN keepalive 1 deadtime 10 warntime 5 initdead 30 bcast bond1 ucast eth0 172.16.0.1 ucast eth0 172.16.0.2 ucast eth2 172.16.0.3 ucast eth2 172.16.0.4 ucast bond0 10.100.2.101 ucast bond0 10.100.2.102 auto_failback off node thor node odin -- Bryce Nesbitt The Berkeley Electronic Press bepress: 10 years of accelerating and enhancing the flow of scholarly ideas _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
