Hello. Today I've got the following situation.
While a print filter script was writing data to usb printer (actualy 'cat data > /dev/usb/lp0' worked), unexpected usb disconnect happened (our printer is somewhat buggy and sometimes behaves like a disconnect happens; in such cases we usually restart it and it works again). The result was that both 'cat' and 'khubd' processes hanged in a busy loop (top showed each was R using 100% CPU, and also keventd/2 had 20%; it was on a dual-xeon system). It was not possible to kill hanged 'cat' process. System log got several hundreds of messages 'usb0: error -19 reading printer status'; looks like then printk buffer get overflown and then messages stopped; restarting klogd resulted in some more copies of the messages. The error message helped me to identify the loop in the kernel where it hanged. It was in usblp_write(). It was in the following code: while (writecount < count) { if (!usblp->wcomplete) { ... } down (&usblp->sem); if (!usblp->present) { up (&usblp->sem); return -ENODEV; } if (usblp->writeurb->status != 0) { if (usblp->quirks & USBLP_QUIRK_BIDIR) { if (!usblp->wcomplete) err("usblp%d: error %d writing to printer", usblp->minor, usblp->writeurb->status); err = usblp->writeurb->status; } else err = usblp_check_status(usblp, err); up (&usblp->sem); /* if the fault was due to disconnect, let khubd's * call to usblp_disconnect() grab usblp->sem ... */ schedule (); continue; } ... } Looks like (!usblp->wcomplete) was false, and (!usblp->present) was false, and (usblp->writeurb->status != 0), so it just looped in this loop forever, ignoring any signals. Since it was on a production server running several user X sessions, I tried to 'fix' the situation without reboot, by writing a tiny kernel module that locates the 'usblp' object from that code and sets 'usblp->present' to false. When I insmoded such thing, the busy loop was really broken and 'cat' process at last got it's SIGKILL (thus somewhat proving the guess of the hanged code), but khubd got an oops. Later attempts to recover from the situation failed (rmmoding usb modules hanged at semaphores, I started to force semaphores up by insmoding code, but at some moment I probably mistyped a binary address and whole system crashed). Anyway. looks like some bug in the mentioned code? It's clear that busy-loop is possible there. Maybe at least it should check for signals after return from schedule()? Kernel 2.6.10 from debian package kernel-image-2.6.10-1-686-smp, version 2.6.10-6.
pgpgmh707zchH.pgp
Description: PGP signature