On Wed, Apr 24, 2013 at 10:27:18AM -0600, Greg Woods wrote:
> On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote:
> > >>> Greg Woods <wo...@ucar.edu> schrieb am 23.04.2013 um 21:20 in Nachricht
> 
> > > Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
> > > Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
> > > Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure ->
> > > Unconnected ) 
> 
> > 
> > You chould use ethtool to check the interface statistics; otherwise I'd 
> > vote for a software issue...
> 
> Ethtool doesn't show any errors, but it's possible that the errors don't
> start occurring until just before DRBD detects the issue. Unfortunately
> I can't access the system once the problems start occurring so I can't
> run ethtool at that point.

hook up a "serial console" to some other system.
If you don't have any hardware to do it, usb-to-serial
and something like "screen" works ok most of the time.

Or add an other nic (other brand, other driver),
or whatever else you can think of to get you access to the system.

If you really "can't access the system" anymore, then that to me
indicates your NICs just stop doing a thing,
and DRBD is just noticing this.

There have been various such bugs with various NIC brands, firmwares
and drivers, where they would just stop receiving or stop sending,
silently.  Sometimes until hard reset, sometimes only ifdown/ifup is
necessary to get things going again.

Google for your NIC brand and driver,
plus some keywords like network stall, tx hang, ...

> If it's a software issue, what is it likely to be? I have to find some
> way to debug this, I'm getting some flak about the outages this is
> causing, even though, so far, they have been three weeks apart. And it
> won't be long before this happens at 3AM, which will really suck.

If possible, just add an other NIC (other brand or at least different
driver!) into the servers, and use that.

That's usually quick to do, does not cost much effort nor money,
and gives fast results. Either the issue persists, and you have
ruled out NIC/driver issues with high confidence.

Or the issue goes away, and you can then proceed with either
blacklisting that other hardware or driver, or look for firmware or
driver upgrades.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to