>>> Greg Woods <wo...@ucar.edu> schrieb am 23.04.2013 um 21:20 in Nachricht <1366744806.4475.56.ca...@mongoliad.scd.ucar.edu>: > Here's a new issue. We have had two outages, about 3 weeks apart, on one > of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this > was logged: > > Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in > time. > Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary -> Unknown ) > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated > Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread > Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed > Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated > Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread > Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started > Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in > time. > Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary -> Unknown ) > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID > 37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F > Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated > Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread > Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed > Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure -> > Unconnected ) > Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated > Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread > Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started > Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected -> > WFConnection ) > > This looks like a long-winded way of saying that the DRBD devices went > offline due to a network failure. One time this was logged on one node, > and the other time it was logged on the other node, so that would seem > to rule out any issue internal to one node (such as bad memory). In both > cases, nothing else is logged in any of the HA logs or > the /var/log/messages file. Obviously, the VMs stop providing services > and this is how the problem is noticed (DNS server not responding, > etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices > that anything is wrong, since nothing is logged after the above until > the restart messages when I finally cycle the power via IPMI (which was > almost half an hour later). The two nodes are connected by a crossover > cable, and that is the link used for DRBD replication. So it seems as > though the only possibilities are a flaky NIC or a flaky cable, but in > that case, wouldn't I see some sort of hardware error logged? Anybody > else ever seen something like this?
You chould use ethtool to check the interface statistics; otherwise I'd vote for a software issue... > > Thanks, > --Greg > > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems