On Wed, 26 Nov 2014, Vincent Pelletier wrote:

> Hi.
> 
> (please keep me cc'ed in replies, I'm not subscribed)
> 
> I'm trying OpenWRT on a DSL modem[1] which has an out-of-tree HCI
> driver. I'm getting one particular process to hang in D 100% of time
> when talking to a 3G USB modem. I'm on current default OpenWRT kernel,
> 3.14.18 .
> 
> Here is the hung process:
> 
> root@OpenWrt:/# echo w > /proc/sysrq-trigger
> [   98.160000] SysRq : Show Blocked State
> [   98.160000]   task                PC stack   pid father
> [   98.160000] gcom            D 8030861c     0  2228   2188 0x08100020
> [   98.160000] Stack : 81bde880 00000001 00000000 80307b18 81bde880 80435698 
> 81bde88c 80414c60
>         81d213c0 00010000 80410000 81bb7180 00000001 8030861c 81bcd480 
> 81452e00
>         81452e00 803a0000 00000000 814bf070 8007fc04 80435698 80435698 
> 81452e00
>         803a0000 00000004 81d21380 00000010 81be2800 81828900 00000000 
> 81d08c28
>         00000000 00000000 81be2804 81be2884 81452e00 803a0000 00000000 
> 81d08c28
>         ...
> [   98.160000] Call Trace:
> [   98.160000] [<8000d954>] __schedule+0x4cc/0x578
> [   98.160000] [<8030861c>] usb_kill_urb+0xe0/0x164
> [   98.160000] [<81828900>] usb_wwan_close+0x100/0x2e4 [usb_wwan]
> [   98.160000] [<802b6e8c>] tty_port_shutdown+0xc0/0xdc
> [   98.160000] [<802b64ac>] tty_port_close+0x30/0xa4
> [   98.160000] [<802b7918>] tty_release+0x168/0x524
> [   98.160000] [<80041038>] __fput+0xf8/0x274
> [   98.160000] [<80289560>] task_work_run+0xf0/0x128
> [   98.160000] [<800ef3d0>] do_exit+0x3fc/0x858
> [   98.160000] [<800f154c>] do_group_exit+0x78/0xb4
> [   98.160000] [<800198a8>] SyS_faccessat+0x0/0x250

Broadly speaking, this means the kernel is waiting for a completion 
interrupt which never arrives.  The interrupt would indicate that the 
controller hardware has finished accessing the data structures 
belonging to the URB being killed.

> As can be seen, it is while cleaning up process resources on exit (at
> kernel level) that the hang happens. If I unplug the modem, process
> exits and I get the following output:
> [ 2862.228000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb 
> failed. (-19)
> [ 2862.236000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb 
> failed. (-19)
> [ 2862.244000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb 
> failed. (-19)
> [ 2862.252000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb 
> failed. (-19)
> [ 2862.260000] option1 ttyUSB0: option_instat_callback: error -150
> 
> Here is the HCD driver source code:
>   
> http://git.openwrt.org/?p=openwrt.git;a=tree;f=package/kernel/lantiq/ltq-hcd/src;h=926980738d4eb39d78fdf116fe4f2f787040fa1d;hb=fee2d7ec702586850e0eafbd82f43c851d1c45b3
> (to the revision I'm at locally, which is 2 days old as I'm writing
> this, and no change since touch this path)

Since this driver seems to be based on work by Synopsys, maybe someone
from that company can help.

(By the way, the license information in the ifxhcd.h header file 
doesn't say anything about the GPL.  That doesn't seem kosher.)

> Poking around with printk debugging (*cough*), I found that the URB
> __schedule is waiting upon has a use_count of 1. Adding printks after
> the 2 atomic_dec and the only atomic_inc on it, I see this URB's history
> as:
> 
> [   99.284000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1553
> [   99.916000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
> [   99.916000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
> [  100.332000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
> [  100.332000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
> [  101.436000] IFXUSB: 81ae2900 urbd->phase 0
> [  101.440000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
> [  101.440000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
> [  101.512000] wait_event 81ae2900
> 
> 1553 is right after usb_hcd_submit_urb's atomic_inc
> 1691 is right after __usb_hcd_giveback_urb's atomic_dec
> usb_hcd_submit_urb's atomic_dec is not hit.
> 
> The IFXUSB line is the HCD driver's ifxhcd_urb_dequeue telling it is
> dequeueing the URB, currently in given phase (here 0, so URBD_IDLE).
> wait_event line is usb_kill_urb calling wait_event or given URB.
> This URB does not appear at any time before 99.284, when I started gcom.

So at time 100.332 the URB completes and the completion routine
resubmits it.  At time 101.436 the driver tries to kill the URB.  At
101.440 the URB completes (presumably with a status code indicating
that it was killed) and the completion routine resubmits it.  
Evidently the resubmission succeeds, because usb_hcd_submit_urb's
atomic_dec is not hit.

Resubmission of a killed URB is not supposed to succeed.  If it does, 
it means there's a bug in the HCD: It isn't calling 
usb_hcd_link_urb_to_ep() properly.

That's the problem with using out-of-tree drivers -- they don't track 
changes to the rest of the kernel.  :-(

> Is it legal for use_count to exceed 1 ?

Yes.

> What would it mean ?

It would mean that the completion routine has resubmitted the URB.  
use_count remains elevated until the completion routine returns, and 
the resubmission increments the counter again.

> Eyeballing usb_wwan I could not see any obvious issue. I saw a large
> patch set for this driver since 3.14.18 and copied it over (as of
> linux-stable b0a9aa6da8088b722326a858ab572a13b5b6f9cb), with the only
> visible improvement of not getting errors when unplugging the modem on
> stuck process.

The problem is unlikely to be in the usb_wwan driver.

> Is my debugging approach even reliable ? I'm reading use_count right
> after an atomic_inc, which looks like a bad idea...
> Any idea on what I should search for/look at next ?

Maybe you can adapt an in-tree HCD to work with your host controller 
hardware.  The ifxhcd driver doesn't look so hot, at first glance.  
Maybe it can be fixed up...

> I tried calling dump_stack after 1553, but kernel fails booting (maybe
> URBs timing out just because of printks going through 115200 bauds
> serial link ?). After ~260 seconds the router reboots (no error
> message, maybe out of memory because of message buffer growing faster
> than serial empties it ?).
> If I only call it when count reaches 2, kernel boots and gives
> stack traces I do not understand, such as (extracted from above run):
> 
> [   99.916000] CPU: 0 PID: 1413 Comm: logd Not tainted 3.14.18 #38
> [   99.916000] Stack : 00000000 00000000 00000000 00000000 8095cdbe 00000033 
> 838bd7e8 803f0000
>           803c1644 8042136f 00000585 80953b00 838bd7e8 803f0000 81d395a8 
> 00000020
>           0000000a 802f9998 00000000 80211878 00000000 00000000 803c49a4 
> 838b5ccc
>           00000000 00000000 00000000 00000000 00000000 00000000 00000000 
> 00000000
>           00000000 00000000 00000000 00000000 00000000 00000000 00000000 
> 838b5c58
>           ...
> [   99.916000] Call Trace:
> [   99.916000] [<80260c70>] show_stack+0x48/0x70
> [   99.916000] [<803071d0>] usb_hcd_submit_urb+0xdc/0xa24
> [   99.916000] [<8194aac0>] usb_wwan_close+0x298/0x2e8 [usb_wwan]
> [   99.916000] [<80066610>] __wake_up+0x28/0x48
> [   99.916000] [<81d3467c>] ifxhcd_hc_cleanup+0xae8/0xba4 [ltq_hcd_vr9]
...

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to