"Chad N. Tindel" <[EMAIL PROTECTED]> wrote:
>
>  We have hit a defect where an exiting xterm process will hang.  This is 
> running
>  on a 2-cpu IA-64 box.  We have a multithreaded application, where one thread
>  is SCHED_FIFO and is running with priority 98, and the other thread is just
>  a normal SCHED_OTHER thread.  The SCHED_FIFO thread is in a CPU bound tight
>  loop, but I wouldn't expect that to cause since there are 2 CPUs.  
> 
>  However, it does seem to cause some problems.  For example, if you ssh into
>  the system and run an Xterm using X11 forwarding, when you type "exit" in
>  the xterm window, the window hangs and doesn't close.  Killing the CPU-bound
>  app causes the window to exit immediately.  The sysrq output shows the 
>  following:
> 
>  xterm         D a0000001000bef60     0  2905   2876                     
> (NOTLB)
> 
>  Call Trace:
>   [<a0000001004ac480>] schedule+0xca0/0x1300
>                                  sp=e000000012257d20 bsp=e000000012251080
>   [<a0000001000bef60>] flush_cpu_workqueue+0x1a0/0x4a0
>                                  sp=e000000012257d30 bsp=e000000012251020
>   [<a0000001000bf360>] flush_workqueue+0x100/0x160
>                                  sp=e000000012257d90 bsp=e000000012250fe8
>   [<a0000001000bfd60>] flush_scheduled_work+0x20/0x40
>                                  sp=e000000012257d90 bsp=e000000012250fd0
>   [<a0000001002e2060>] release_dev+0x8e0/0x1100
>                                  sp=e000000012257d90 bsp=e000000012250f20
>   [<a0000001002e3350>] tty_release+0x30/0x60
>                                  sp=e000000012257e30 bsp=e000000012250ef8
>   [<a00000010012d430>] __fput+0x330/0x340
>                                  sp=e000000012257e30 bsp=e000000012250ea8
>   [<a00000010012d0e0>] fput+0x40/0x60
>                                  sp=e000000012257e30 bsp=e000000012250e88
>   [<a00000010012a1b0>] filp_close+0xd0/0x160
>                                  sp=e000000012257e30 bsp=e000000012250e58
>   [<a00000010012a380>] sys_close+0x140/0x1a0
>                                  sp=e000000012257e30 bsp=e000000012250dd8
>   [<a00000010000aba0>] ia64_ret_from_syscall+0x0/0x20
>                                  sp=e000000012257e30 bsp=e000000012250dd8
> 
>  So it would appear that xterm is hung in close() trying to shutdown a tty.
>  The comment says that is calling flush_scheduled_work() to 
>  "Wait for ->hangup_work and ->flip.work handlers to terminate".  Perhaps 
> there
>  is some locking issue that is causing these to not run and complete?

`xterm' is waiting for the other CPU to schedule a kernel thread (which is
bound to that CPU).  Once that kernel thread has done a little bit of work,
`xterm' can terminate.

But kernel threads don't run with realtime policy, so your userspace app
has permanently starved that kernel thread.

It's potentially quite a problem, really.  For example it could prevent
various tty operations from completing, it will prevent kjournald from ever
writing back anything (on uniprocessor, etc).  I've been waiting for
someone to complain ;)

But the other side of the coin is that a SCHED_FIFO userspace task
presumably has extreme latency requirements, so it doesn't *want* to be
preempted by some routine kernel operation.  People would get irritated if
we were to do that.

So what to do?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to