Here's a new issue. We have had two outages, about 3 weeks apart, on one of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this was logged:
Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in time. Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure -> Unconnected ) Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected -> WFConnection ) Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in time. Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID 37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure -> Unconnected ) Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected -> WFConnection ) This looks like a long-winded way of saying that the DRBD devices went offline due to a network failure. One time this was logged on one node, and the other time it was logged on the other node, so that would seem to rule out any issue internal to one node (such as bad memory). In both cases, nothing else is logged in any of the HA logs or the /var/log/messages file. Obviously, the VMs stop providing services and this is how the problem is noticed (DNS server not responding, etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices that anything is wrong, since nothing is logged after the above until the restart messages when I finally cycle the power via IPMI (which was almost half an hour later). The two nodes are connected by a crossover cable, and that is the link used for DRBD replication. So it seems as though the only possibilities are a flaky NIC or a flaky cable, but in that case, wouldn't I see some sort of hardware error logged? Anybody else ever seen something like this? Thanks, --Greg _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems