Hello Shachar.

I suppose great minds think alike. ;) This is more or less what I've already done, and with some correspondence on LKML, I've managed to nail the exact place where the kernel goes to sleep for 30 seconds taking the lock with it. Sorry for not updating you, but my thought was to summarize the issue when the patch is out.


So a kernel bug it is indeed. It wasn't the simple thing I thought, though. It turns out, that under some conditions, a close() operation on a serial port may require waiting until all written data is flushed. In these cases, it's normal behavior that returning from the close() call takes up to 30 seconds (the timeout). In particular, when it's modem-manager trying to probe serial ports that don't really exist.


Now, work was done on the TTYs' locking schema between kernels 2.6.35 and 2.6.36, and the maintainer overlooked this possibility, and hence had no problem with the lock being held at that crucial point when the data is being drained. As a result, all system calls related to TTYs and PTYs (that is, serial ports and *cough* virtual terminals, and I suppose other keyboard input) were frozen waiting for the big TTY mutex every time modem-manager chose to close a port. Which it does a few times after a boot.


The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some reason, my postings break the threading every time, even though I reply-to-all).


This is not a trivial thing to solve, since it seems like nobody really knows what assumptions have been made on the two muteces involved, so it's not so clear if one can release them just before the possible long sleep. I suppose the guy who manipulated the locks will come up with a fix sooner or later. I've reverted to 2.6.35 anyhow.


Ah, and by the way, what started this thing was an effort to stop using the big kernel lock. As a matter of fact, from 2.6.36, the big kernel lock is no longer used in core kernel code. (Hurray...?)


So that's the way things stand. I can't say I was very encouraged by this little trip to kernel-land, and I can only hope that those who maintain the software controlling my car's airbag are doing so with a deeper understanding of what each software component stands. Don't tell me. They probably don't. Only they don't discuss their issues over a public mailing list.

  Eli



Shachar Shemesh wrote:

On 29/10/10 17:04, Eli Billauer wrote:

    /* find a device that is not in use. */
    printk(KERN_ALERT  "34: pty_open to lock\n");
    tty_lock();
    printk(KERN_ALERT  "35: pty_open locked\n");
<snip>

Set a global variable right before the tty_lock call, and clear it immediately after. Inside tty_lock (and probably tty_unlock too), set up many printks conditional on this global variable being set. Print any relevant identifier you can find (such as the device ID). This should help you find out WHY the device takes so long to lock. and hopefully, who the contention is with.

Also, in tty_lock, save to a global variable who is holding the lock, and print that variable from the code above.

Shachar
--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com


--
Web: http://www.billauer.co.il

_______________________________________________
Haifux mailing list
[email protected]
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux

Reply via email to