On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:
> Thanks for reporting this. I thought we had fixed this in 3.0.
> Before then, when we had a tid wrap, it would result in kjournald
> spinning forever. I suspect this was your "spontaneous reboots" that
> you mentioned you mentioned when you were using 2.6.39 --- did you
> have a hardware or softward watchdog timer enabled by any chance?
> 
> 

Thank you for your prompt attention on this.  It's greatly appreciated!

We believe our previous spontaneous reboots were caused by 
https://bugzilla.kernel.org/show_bug.cgi?id=16991 which was resolved by our 
move to a 3.2 kernel (we were on a 2.6.38-bpo kernel ^1).  We do not presently 
use any watchdogs.
> Since we didn't have a good way of reproducing the problem at the
> time, I didn't realize that the problem had not been fully fixed;
> since while jbd2_log_start_commit() would no longer cause kjournald to
> spin forwever, a subsequent call to jbd2_log_wait_commit() with a
> stale transaction id would wait for a very long time (possibly until
> the heat death of the universe :-)
> 
> 

This would mirror what we've seen, although our ops guys haven't been waiting 
around for any universes to die :)
> I think a patch like this should fix things; I've run a stress test
> with a hack to increment the transaction id by 1 << 24 after each
> commit, to more quickly cause an tid wrap, and the regression tests
> seem to be passing without complaint.
> 
> 

Excellent news.  Again, thank you for your help in this regard.

@Ben - could you let me know what your preferred course of action would be 
here?  As I'm sure you can understand, I do not wish to maintain a forked 
kernel from Debian upstream.  Is this something you would be prepared to 
integrate into the 3.2 BPO kernels?

Best regards,

George

1. We moved to 2.6.38 in order to get access to the packet steering patches 
which were put into the kernel in ~.33 or .34 from memory.  This gave us quite 
a nice performance bump to our storage speed and hence we didn't want to lose 
it by going back to .32 to get the 200d uptime bug fix.

Reply via email to