On Wed, Apr 24, 2013 at 10:27:18AM -0600, Greg Woods wrote: > On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote: > > >>> Greg Woods <wo...@ucar.edu> schrieb am 23.04.2013 um 21:20 in Nachricht > > > > Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread > > > Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed > > > Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure -> > > > Unconnected ) > > > > > You chould use ethtool to check the interface statistics; otherwise I'd > > vote for a software issue... > > Ethtool doesn't show any errors, but it's possible that the errors don't > start occurring until just before DRBD detects the issue. Unfortunately > I can't access the system once the problems start occurring so I can't > run ethtool at that point.
hook up a "serial console" to some other system. If you don't have any hardware to do it, usb-to-serial and something like "screen" works ok most of the time. Or add an other nic (other brand, other driver), or whatever else you can think of to get you access to the system. If you really "can't access the system" anymore, then that to me indicates your NICs just stop doing a thing, and DRBD is just noticing this. There have been various such bugs with various NIC brands, firmwares and drivers, where they would just stop receiving or stop sending, silently. Sometimes until hard reset, sometimes only ifdown/ifup is necessary to get things going again. Google for your NIC brand and driver, plus some keywords like network stall, tx hang, ... > If it's a software issue, what is it likely to be? I have to find some > way to debug this, I'm getting some flak about the outages this is > causing, even though, so far, they have been three weeks apart. And it > won't be long before this happens at 3AM, which will really suck. If possible, just add an other NIC (other brand or at least different driver!) into the servers, and use that. That's usually quick to do, does not cost much effort nor money, and gives fast results. Either the issue persists, and you have ruled out NIC/driver issues with high confidence. Or the issue goes away, and you can then proceed with either blacklisting that other hardware or driver, or look for firmware or driver upgrades. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems