Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
freebsd-stable)
Re: System hang in USB umass module while processing panic  (originally on 
freebsd-usb)

Hello Andriy and Hans,

Sorry for tying in so many discussions on this topic, but I think I have an 
explanation for the problems we have been reporting* with hanging coredumps on 
multicore systems on 8.2-RELEASE, and it has implications for Andriy's proposed 
scheduler patch** and for USB.

In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
when the kernel panics, but many parts of the locking code get disabled (grep 
on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
syncer encountering an error.  If that happens when it's on the dumping CPU 
everything hangs.  If it's running on a different CPU, it will be blocked and 
hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
every attached keyboard for a Ctl-C.

But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
variable that broke coredumps for us in the 7.1->8.2 transition, I think.)  SVN 
224223 fixes a hang that would happen when dumpsys() polls the USB keyboard 
(IPMI KVM, in our case).  That helps, but it only gets as far as usb_process(), 
where it hangs in a loop around a cv_wait() call.  This is easy to reproduce by 
adding code to the watchdog to break into the debugger if panicstr is set.

I am experimenting with Andriy's patch** to stop the scheduler and it seems to 
be most of the way there, stopping the CPUs and disabling the rest of locking.  
There are a few places that still reference panicstr, but that's minor.  These 
are the changes I made to the patch:
 * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, 
so that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and 
if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop mutexes.
 * Changed the call to spinlock_enter() back to critical_enter(), so that 
interrupts stay enabled and the hardclock still functions.
 * Added code in the beginning of panic() to switch to CPU 0, so that we're 
able to service the hardclock interrupts and so that watchdog panics get 
through.

This has worked 100% for me so far, although anyone using a USB keyboard or 
dump device would still be out of luck.

Thoughts?  It seems like stopping all of the other CPUs is the right thing to 
do on a panic (what are they doing otherwise?).  Are the USB issues fixable?  
If Andriy's patch get committed it might just involve short-circuiting all of 
the locking in the polling path, but I haven't gotten that far yet.  I bet 
dumping to NFS will have the same problem.

Thanks,
  Andrew

* - http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/155421
** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff
--------------------------------------------------
Andrew Boyer    [email protected]




_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Reply via email to