Here's a new issue. We have had two outages, about 3 weeks apart, on one
of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this
was logged:

Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in
time.
Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated
Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure ->
Unconnected ) 
Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated
Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread
Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started
Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected ->
WFConnection ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in
time.
Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID
37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F
Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated
Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread
Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed
Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure ->
Unconnected ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated
Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread
Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started
Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected ->
WFConnection ) 

This looks like a long-winded way of saying that the DRBD devices went
offline due to a network failure. One time this was logged on one node,
and the other time it was logged on the other node, so that would seem
to rule out any issue internal to one node (such as bad memory). In both
cases, nothing else is logged in any of the HA logs or
the /var/log/messages file. Obviously, the VMs stop providing services
and this is how the problem is noticed (DNS server not responding,
etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices
that anything is wrong, since nothing is logged after the above until
the restart messages when I finally cycle the power via IPMI (which was
almost half an hour later). The two nodes are connected by a crossover
cable, and that is the link used for DRBD replication. So it seems as
though the only possibilities are a flaky NIC or a flaky cable, but in
that case, wouldn't I see some sort of hardware error logged? Anybody
else ever seen something like this?

Thanks,
--Greg



_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to