We recently had a split brain incident, where after reboot a bonded
interface did not come up
(a security guard pressed the emergency power off button the in the data
center... and as far as we can tell this fried the Intel nics connecting the
two servers).

As a result I'm looking for better ways to audit that each of our (now)
redundant heartbeat links are in fact up. How is this best done?  All I
really see in the heartbeat logs are memory stats:

Jun  1 22:01:04 odin heartbeat: [916]: info: MSG stats: 0/0 ms age
20643855300 [pid6231/HBREAD]
Jun  1 22:01:04 odin heartbeat: [916]: info: ha_malloc stats: 500/816
 62512/31219 [pid6231/HBREAD]
Jun  1 22:01:04 odin heartbeat: [916]: info: RealMalloc stats: 63200 total
malloc bytes. pid [6231/HBREAD]
Jun  1 22:01:04 odin heartbeat: [916]: info: These are nothing to worry
about.

Where I'd really want to see the status of each ha.cf configured link as it
goes up or down.  Is something like this possible?

Jun  1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 down
Jun  1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 down 250
times
Jun  1 22:01:04 odin heartbeat: [916]: info: link eth0 172.16.0.1 up

-----------------------------------------------------------------------------------------------------
logfacility     local2  # local0=syslog.core local2=/var/log/ha.log -BN
keepalive 1
deadtime 10
warntime 5
initdead 30
bcast   bond1
ucast eth0 172.16.0.1
ucast eth0 172.16.0.2
ucast eth2 172.16.0.3
ucast eth2 172.16.0.4
ucast bond0 10.100.2.101
ucast bond0 10.100.2.102
auto_failback off
node    thor
node    odin


-- 
Bryce Nesbitt
The Berkeley Electronic Press
bepress: 10 years of accelerating and enhancing the flow of scholarly ideas
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to