Last night I discovered a bug in my subvol removal script on one of my 
servers.  I fixed the bug and ran the script to delete ~1500 subvols and the 
system locked up.  It would respond to pings and accept TCP connections but 
nothing else.  Existing ssh sessions didn't respond and connections to port 22 
got a TCP connection but not even the start of the ssh handshake (IE sshd was 
hung).

Today I visited the server and connected a keyboard, I couldn't get a keyboard 
response to even unblank the screen (it's at a virtual console and X isn't 
installed.

I rebooted the system and found that the bug fix to the subvol removal script 
was lost, the old version of the file was in place.  So all changes to the 
filesystem from 10+ seconds before the subvol removal were discarded.

INFO: rcu_sched self detected stall on CPU { 1}  (t=5250 jiffies g=7989 c=7988 
q=3)
BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424]
BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287]
BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424]
BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287]
INFO: rcu_sched self detected stall on CPU { 1}  (t=21003 jiffies g=7989 c=7988 
q=3)
BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424]
BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287]
BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424]
BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287]

After booting up I ran the new script and told it to delete only ~440 subvoles 
and got the above messages on the console.  Those messages keep repeating with 
the only differences being the value of the t= parameter and the number of 
seconds (which varies between 22 and 23 seconds).  At that time the keyboard 
didn't get any response from the system and sshd didn't even start it's 
protocol (user-space seems dead).

I had to do a hardware reset as CTRL-ALT-DEL and briefly pressing the power 
button had no affect.

After that I installed kernel 3.14.4 and deleted 442 subvols and then soon 
after another 1104 without any problems.

I've noticed before that generally newer kernels have been fixing the various 
crashes related to snapshot creation and removal, but this is the first time 
I've done a clean repeatable test to show 3.14 working where 3.13 failed.

I had been hesitant to upgrade to 3.14 because I've seen it fail horribly with 
Xen, but in this case it worked well with Xen.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to