Hello.

Today I've got the following situation.

While a print filter script was writing data to usb printer (actualy
'cat data > /dev/usb/lp0' worked), unexpected usb disconnect happened (our 
printer is somewhat buggy and sometimes behaves like a disconnect happens; 
in such cases we usually restart it and it works again).

The result was that both 'cat' and 'khubd' processes hanged in a busy loop 
(top showed each was R using 100% CPU, and also keventd/2 had 20%; it was 
on a dual-xeon system). It was not possible to kill hanged 'cat' process.

System log got several hundreds of messages 'usb0: error -19 reading 
printer status'; looks like then printk buffer get overflown and then 
messages stopped; restarting klogd resulted in some more copies of the 
messages.
The error message helped me to identify the loop in the kernel where it 
hanged. It was in usblp_write(). It was in the following code:

while (writecount < count) {
  if (!usblp->wcomplete) {
    ...
  }
  down (&usblp->sem);
  if (!usblp->present) {
     up (&usblp->sem);
     return -ENODEV;
  }
  if (usblp->writeurb->status != 0) {
     if (usblp->quirks & USBLP_QUIRK_BIDIR) {
        if (!usblp->wcomplete)
          err("usblp%d: error %d writing to printer",
                   usblp->minor, usblp->writeurb->status);
        err = usblp->writeurb->status;
     } else
         err = usblp_check_status(usblp, err);
     up (&usblp->sem);
      /* if the fault was due to disconnect, let khubd's
       * call to usblp_disconnect() grab usblp->sem ...
       */
     schedule ();
     continue;
  }
...
}

Looks like (!usblp->wcomplete) was false, and (!usblp->present) was false, 
and (usblp->writeurb->status != 0), so it just looped in this loop 
forever, ignoring any signals.

Since it was on a production server running several user X sessions, I 
tried to 'fix' the situation without reboot, by writing a tiny kernel 
module that locates the 'usblp' object from that code and sets 
'usblp->present' to false. When I insmoded such thing, the busy loop was 
really broken and 'cat' process at last got it's SIGKILL (thus somewhat 
proving the guess of the hanged code), but khubd got an oops. Later 
attempts to recover from the situation failed (rmmoding usb modules hanged 
at semaphores, I started to force semaphores up by insmoding code, but at 
some moment I probably mistyped a binary address and whole system 
crashed).

Anyway. looks like some bug in the mentioned code? It's clear that 
busy-loop is possible there. Maybe at least it should check for signals 
after return from schedule()?

Kernel 2.6.10 from debian package kernel-image-2.6.10-1-686-smp, version 
2.6.10-6.

Attachment: pgpgmh707zchH.pgp
Description: PGP signature

Reply via email to