[gentoo-amd64] Re: kernel-2.6.29-gentoo, network device failing after maybe 10min

Duncan Thu, 26 Mar 2009 11:59:34 -0700

Tom <uebersh...@googlemail.com> posted
20090326182608.5da93...@viciousvincent, excerpted below, on  Thu, 26 Mar
2009 18:26:08 +0100:


> I've upgraded to the 2.6.29-gentoo sources. I've build everything as
> usual, and sofar, everything seems to be working.
> Except that my network device 'dies' (not permanently) after working
> flawlessly for maybe 10min.
> 
> Booting a 2.6.28 kernel, I have no such issues. Restarting
> /etc/init.d/net.eth0 has no effect, and using ifconfig up/down eth0 just
> times out.
> 
> The drivers are all there as they should be, could this be somekind of
> weird regression? I'm using the Uli M526x driver, found under the
> 'tulip-family'

This is in fact a mainline regression, due to one of the last patches 
before the release that changed NAPI handling but apparently has 
interrupt implications as well.  The LKML 2.6.29 announcement had a reply 
mentioning the regression and several confirmations, then discussion as 
they try to pin it down with various patches and repeated tests.  They 
intend a fix for 2.6.29.1, even if it's simply reverting the late patch.  
However, that patch was itself a fix for a problem on other NICs, and 
other code intended to revert the effects of the patch still ends up 
tickling the interrupt problem so it's a bit more complex than they 
anticipated.  But the normal rule is no breaking previously working 
hardware so had that patch made it even a day earlier it would have 
likely been reverted before release, and if they can't find a better 
solution, it almost certainly /will/ be reverted for .29.1.

That was one of two subthreads generated by the announcement.  The other 
one was related to the temporarily fixed for .29 ext4 data corruption bug 
that made big news in the -rc period.  They did a temp fix for .29.  Now 
that it's out, they're trying to come up with a more permanent solution, 
but there's a policy debate in the process, as to whether the (lack of) 
data stability guarantees in POSIX in the event of an improper shutdown 
is acceptable or not.  The one side says POSIX doesn't require more and 
that the default data=ordered stability of ext3 was an "accident", while 
the other says that may be, but now that the stability expectation has 
been raised, changing it in the interest of "performance" isn't a good 
thing.  The other bit of the debate is just how "ordered" data=ordered 
has to be.  The performance side says if metadata is synced every five 
seconds (the default) while data is only synced every 30 seconds (again 
the default) with delayed allocation, and a crash causes loss of data, 
tough, it's POSIX compliant and the performance benefits are great.  The 
other side says data=ordered means data=ordered, that metadata MUST wait 
to sync until after the data it covers is synced in data=ordered mode 
(the default), REGARDLESS of delayed allocation, even if the cost is loss 
of some of the vaunted performance gains of ext4 over ext3.

Basically what the latter one boils down to for me and many others is 
that despite the rename of ext4dev to ext4, supposedly indicating it's 
stable now, it's NOT, at least not enough for mission critical data that 
in real life may or may not have up-to-date backups!  Ext3 (or for me 
reiserfs in the same data=ordered default mode) continues to work well, 
and it's not time to go moving everything to ext4 just yet.

Find the announcement thread on any LKML mirror, or covered in some 
kernel news discussions, for more.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

[gentoo-amd64] Re: kernel-2.6.29-gentoo, network device failing after maybe 10min

Reply via email to