>>> Greg Woods <wo...@ucar.edu> schrieb am 23.04.2013 um 21:20 in Nachricht
<1366744806.4475.56.ca...@mongoliad.scd.ucar.edu>:
> Here's a new issue. We have had two outages, about 3 weeks apart, on one
> of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this
> was logged:
> 
> Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in
> time.
> Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary -> Unknown )
> conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
> Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated
> Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
> Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
> Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected ) 
> Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated
> Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread
> Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started
> Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected ->
> WFConnection ) 
> Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in
> time.
> Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary -> Unknown )
> conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
> Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID
> 37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F
> Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated
> Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread
> Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed
> Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure ->
> Unconnected ) 
> Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated
> Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread
> Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started
> Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected ->
> WFConnection ) 
> 
> This looks like a long-winded way of saying that the DRBD devices went
> offline due to a network failure. One time this was logged on one node,
> and the other time it was logged on the other node, so that would seem
> to rule out any issue internal to one node (such as bad memory). In both
> cases, nothing else is logged in any of the HA logs or
> the /var/log/messages file. Obviously, the VMs stop providing services
> and this is how the problem is noticed (DNS server not responding,
> etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices
> that anything is wrong, since nothing is logged after the above until
> the restart messages when I finally cycle the power via IPMI (which was
> almost half an hour later). The two nodes are connected by a crossover
> cable, and that is the link used for DRBD replication. So it seems as
> though the only possibilities are a flaky NIC or a flaky cable, but in
> that case, wouldn't I see some sort of hardware error logged? Anybody
> else ever seen something like this?

You chould use ethtool to check the interface statistics; otherwise I'd vote 
for a software issue...

> 
> Thanks,
> --Greg
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 
> 

 
 

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to