#18424: gcom hanging on on exit on TP-Link W8970
--------------------------------+------------------------
 Reporter:  vpelletier          |      Owner:  developers
     Type:  defect              |     Status:  new
 Priority:  normal              |  Milestone:
Component:  kernel              |    Version:  Trunk
 Keywords:  ltq-hcd usb ifxhcd  |
--------------------------------+------------------------
 Trying to get a 3G USB modem to work on my TP-Link, I got gcom to hang
 with 100% reproductibility when running:
 {{{
 root@OpenWrt:/# gcom -d "/dev/ttyUSB0" -s /etc/gcom/getcardinfo.gcom
 ATI


 Manufacturer: huawei

 Model: E160X

 Revision: 11.604.21.05.314

 IMEI: [REDACTED]

 +GCAP: +CGSM,+FCLASS,+DS



 OK


 }}}
 Command line taken from 3g.sh .

 I investigated a bit and reported to linux-usb ML for advice:
   http://marc.info/?l=linux-usb&m=141703314320321&w=2
 And mails extracts ("[...]" added by me to reduce duplications from
 quoting previous mail) so this report is stand-alone:
 {{{
 From: Vincent Pelletier
 Date: Wed, 26 Nov 2014 21:18:55 +0100


 I'm trying OpenWRT on a DSL modem[1] which has an out-of-tree HCI
 driver. I'm getting one particular process to hang in D 100% of time
 when talking to a 3G USB modem. I'm on current default OpenWRT kernel,
 3.14.18 .

 Here is the hung process:

 root@OpenWrt:/# echo w > /proc/sysrq-trigger
 [   98.160000] SysRq : Show Blocked State
 [   98.160000]   task                PC stack   pid father
 [   98.160000] gcom            D 8030861c     0  2228   2188 0x08100020
 [   98.160000] Stack : 81bde880 00000001 00000000 80307b18 81bde880
 80435698 81bde88c 80414c60
           81d213c0 00010000 80410000 81bb7180 00000001 8030861c 81bcd480
 81452e00
           81452e00 803a0000 00000000 814bf070 8007fc04 80435698 80435698
 81452e00
           803a0000 00000004 81d21380 00000010 81be2800 81828900 00000000
 81d08c28
           00000000 00000000 81be2804 81be2884 81452e00 803a0000 00000000
 81d08c28
           ...
 [   98.160000] Call Trace:
 [   98.160000] [<8000d954>] __schedule+0x4cc/0x578
 [   98.160000] [<8030861c>] usb_kill_urb+0xe0/0x164
 [   98.160000] [<81828900>] usb_wwan_close+0x100/0x2e4 [usb_wwan]
 [   98.160000] [<802b6e8c>] tty_port_shutdown+0xc0/0xdc
 [   98.160000] [<802b64ac>] tty_port_close+0x30/0xa4
 [   98.160000] [<802b7918>] tty_release+0x168/0x524
 [   98.160000] [<80041038>] __fput+0xf8/0x274
 [   98.160000] [<80289560>] task_work_run+0xf0/0x128
 [   98.160000] [<800ef3d0>] do_exit+0x3fc/0x858
 [   98.160000] [<800f154c>] do_group_exit+0x78/0xb4
 [   98.160000] [<800198a8>] SyS_faccessat+0x0/0x250

 As can be seen, it is while cleaning up process resources on exit (at
 kernel level) that the hang happens. If I unplug the modem, process
 exits and I get the following output:
 [ 2862.228000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb
 failed. (-19)
 [ 2862.236000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb
 failed. (-19)
 [ 2862.244000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb
 failed. (-19)
 [ 2862.252000] option1 ttyUSB0: usb_wwan_indat_callback: resubmit read urb
 failed. (-19)
 [ 2862.260000] option1 ttyUSB0: option_instat_callback: error -150

 Here is the HCD driver source code:
   http://git.openwrt.org/?p=openwrt.git;a=tree;f=package/kernel/lantiq
 /ltq-
 
hcd/src;h=926980738d4eb39d78fdf116fe4f2f787040fa1d;hb=fee2d7ec702586850e0eafbd82f43c851d1c45b3
 (to the revision I'm at locally, which is 2 days old as I'm writing
 this, and no change since touch this path)

 Poking around with printk debugging (*cough*), I found that the URB
 __schedule is waiting upon has a use_count of 1. Adding printks after
 the 2 atomic_dec and the only atomic_inc on it, I see this URB's history
 as:

 [   99.284000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1553
 [   99.916000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
 [   99.916000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
 [  100.332000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
 [  100.332000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
 [  101.436000] IFXUSB: 81ae2900 urbd->phase 0
 [  101.440000] 81ae2900->use_count 2 drivers/usb/core/hcd.c:1553
 [  101.440000] 81ae2900->use_count 1 drivers/usb/core/hcd.c:1691
 [  101.512000] wait_event 81ae2900

 1553 is right after usb_hcd_submit_urb's atomic_inc
 1691 is right after __usb_hcd_giveback_urb's atomic_dec
 usb_hcd_submit_urb's atomic_dec is not hit.

 The IFXUSB line is the HCD driver's ifxhcd_urb_dequeue telling it is
 dequeueing the URB, currently in given phase (here 0, so URBD_IDLE).
 wait_event line is usb_kill_urb calling wait_event or given URB.
 This URB does not appear at any time before 99.284, when I started gcom.

 Is it legal for use_count to exceed 1 ?
 What would it mean ?
 Eyeballing usb_wwan I could not see any obvious issue. I saw a large
 patch set for this driver since 3.14.18 and copied it over (as of
 linux-stable b0a9aa6da8088b722326a858ab572a13b5b6f9cb), with the only
 visible improvement of not getting errors when unplugging the modem on
 stuck process.
 Is my debugging approach even reliable ? I'm reading use_count right
 after an atomic_inc, which looks like a bad idea...
 Any idea on what I should search for/look at next ?

 I tried calling dump_stack after 1553, but kernel fails booting (maybe
 URBs timing out just because of printks going through 115200 bauds
 serial link ?). After ~260 seconds the router reboots (no error
 message, maybe out of memory because of message buffer growing faster
 than serial empties it ?).
 If I only call it when count reaches 2, kernel boots and gives
 stack traces I do not understand, such as (extracted from above run):

 [   99.916000] CPU: 0 PID: 1413 Comm: logd Not tainted 3.14.18 #38
 [   99.916000] Stack : 00000000 00000000 00000000 00000000 8095cdbe
 00000033 838bd7e8 803f0000
           803c1644 8042136f 00000585 80953b00 838bd7e8 803f0000 81d395a8
 00000020
           0000000a 802f9998 00000000 80211878 00000000 00000000 803c49a4
 838b5ccc
           00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000
           00000000 00000000 00000000 00000000 00000000 00000000 00000000
 838b5c58
           ...
 [   99.916000] Call Trace:
 [   99.916000] [<80260c70>] show_stack+0x48/0x70
 [   99.916000] [<803071d0>] usb_hcd_submit_urb+0xdc/0xa24
 [   99.916000] [<8194aac0>] usb_wwan_close+0x298/0x2e8 [usb_wwan]
 [   99.916000] [<80066610>] __wake_up+0x28/0x48
 [   99.916000] [<81d3467c>] ifxhcd_hc_cleanup+0xae8/0xba4 [ltq_hcd_vr9]

 [  100.332000] CPU: 0 PID: 973 Comm: kworker/0:2 Not tainted 3.14.18 #38
 [  100.332000] Workqueue: events usb_serial_suspend [usbserial]
 [  100.332000] Stack : 818c8680 803c1644 80414c60 81ae2908 803f0000
 803f0000 81d395a8 00000020
           0000000a 802f9998 000003cd 80211ffc 00000000 00000000 803e2e00
 818bdb34
           818bdb34 802f9998 00000000 80211aa4 00000000 000005f2 000005f2
 8094d7e0
           00000000 00000000 00000000 00000000 00000000 00000000 6576656e
 74730000
           00000000 00000000 00000000 00000000 8381bb80 81085500 818507d4
 818bdae0
           ...
 [  100.332000] Call Trace:
 [  100.332000] [<80260c70>] show_stack+0x48/0x70
 [  100.332000] [<803071d0>] usb_hcd_submit_urb+0xdc/0xa24
 [  100.332000] [<8194aac0>] usb_wwan_close+0x298/0x2e8 [usb_wwan]

 [  101.440000] CPU: 0 PID: 2254 Comm: gcom Not tainted 3.14.18 #38
 [  101.440000] Stack : 00000000 00000000 00000000 00000000 8095cdbe
 00000033 81507748 803f0000
           803c1644 8042136f 000008ce 80953b00 81507748 803f0000 80410000
 00000020
           7fc48cf0 802f9998 00000000 80211878 00000000 00000000 803c49a4
 814c3b14
           00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000
           00000000 00000000 00000000 00000000 00000000 00000000 00000000
 814c3aa0
           ...
 [  101.440000] Call Trace:
 [  101.440000] [<80260c70>] show_stack+0x48/0x70
 [  101.440000] [<803071d0>] usb_hcd_submit_urb+0xdc/0xa24
 [  101.440000] [<8194aac0>] usb_wwan_close+0x298/0x2e8 [usb_wwan]

 [1] http://wiki.openwrt.org/toh/tp-link/td-w8970

 }}}

 {{{
 From: Alan Stern
 Date: Wed, 26 Nov 2014 16:25:08 -0500 (EST)

 On Wed, 26 Nov 2014, Vincent Pelletier wrote:

 > Hi.
 [...]
 > [   98.160000] [<800198a8>] SyS_faccessat+0x0/0x250

 Broadly speaking, this means the kernel is waiting for a completion
 interrupt which never arrives.  The interrupt would indicate that the
 controller hardware has finished accessing the data structures
 belonging to the URB being killed.

 > As can be seen, it is while cleaning up process resources on exit (at
 [...]
 > this, and no change since touch this path)

 Since this driver seems to be based on work by Synopsys, maybe someone
 from that company can help.

 (By the way, the license information in the ifxhcd.h header file
 doesn't say anything about the GPL.  That doesn't seem kosher.)

 > Poking around with printk debugging (*cough*), I found that the URB
 [...]
 > This URB does not appear at any time before 99.284, when I started gcom.

 So at time 100.332 the URB completes and the completion routine
 resubmits it.  At time 101.436 the driver tries to kill the URB.  At
 101.440 the URB completes (presumably with a status code indicating
 that it was killed) and the completion routine resubmits it.
 Evidently the resubmission succeeds, because usb_hcd_submit_urb's
 atomic_dec is not hit.

 Resubmission of a killed URB is not supposed to succeed.  If it does,
 it means there's a bug in the HCD: It isn't calling
 usb_hcd_link_urb_to_ep() properly.

 That's the problem with using out-of-tree drivers -- they don't track
 changes to the rest of the kernel.  :-(

 > Is it legal for use_count to exceed 1 ?

 Yes.

 > What would it mean ?

 It would mean that the completion routine has resubmitted the URB.
 use_count remains elevated until the completion routine returns, and
 the resubmission increments the counter again.

 > Eyeballing usb_wwan I could not see any obvious issue. I saw a large
 > patch set for this driver since 3.14.18 and copied it over (as of
 > linux-stable b0a9aa6da8088b722326a858ab572a13b5b6f9cb), with the only
 > visible improvement of not getting errors when unplugging the modem on
 > stuck process.

 The problem is unlikely to be in the usb_wwan driver.

 > Is my debugging approach even reliable ? I'm reading use_count right
 > after an atomic_inc, which looks like a bad idea...
 > Any idea on what I should search for/look at next ?

 Maybe you can adapt an in-tree HCD to work with your host controller
 hardware.  The ifxhcd driver doesn't look so hot, at first glance.
 Maybe it can be fixed up...
 }}}

 {{{
 From: Vincent Pelletier
 To: Alan Stern
 Date: Thu, 27 Nov 2014 08:47:35 +0100

 On Wed, 26 Nov 2014 16:25:08 -0500 (EST), Alan Stern
 <[email protected]> wrote:
 > Since this driver seems to be based on work by Synopsys, maybe someone
 > from that company can help.
 >
 > (By the way, the license information in the ifxhcd.h header file
 > doesn't say anything about the GPL.  That doesn't seem kosher.)

 Actually the topmost chunk, mentioning Lantiq copyright, does. But it
 also says parts (which ?) are copyright Synopsys with a "redistribute
 but keep this notice" licence.

 > So at time 100.332 the URB completes and the completion routine
 > resubmits it.  At time 101.436 the driver tries to kill the URB.  At
 > 101.440 the URB completes (presumably with a status code indicating
 > that it was killed) and the completion routine resubmits it.
 > Evidently the resubmission succeeds, because usb_hcd_submit_urb's
 > atomic_dec is not hit.
 >
 > Resubmission of a killed URB is not supposed to succeed.

 Aha, this may just be what I was missing. I was aiming at getting
 use_count down by finding some missing completion call, rather than
 questioning that last submission's success.

 > If it does, it means there's a bug in the HCD: It isn't calling
 > usb_hcd_link_urb_to_ep() properly.

 Indeed, no trace of this function being called.

 > Maybe you can adapt an in-tree HCD to work with your host controller
 > hardware.  The ifxhcd driver doesn't look so hot, at first glance.
 > Maybe it can be fixed up...

 Looking at drivers/usb/host I see quite short files which I understand
 as hardware-specific (ex: 551 lines for ehci-terga.c). Such length
 should be in my reach, but the 6k lines of some non-[eo]hci are very
 likely not. I'll try to understand how this controller would fit in.
 }}}

 {{{
 From: Alan Stern
 Subject: Re: Suspected (out of tree) HCI issue

 On Thu, 27 Nov 2014, Vincent Pelletier wrote:

 > > If it does, it means there's a bug in the HCD: It isn't calling
 > > usb_hcd_link_urb_to_ep() properly.
 >
 > Indeed, no trace of this function being called.

 If the Synopsys driver hasn't been updated since before that function
 was added to the USB stack, it must be quite out of date.

 > > Maybe you can adapt an in-tree HCD to work with your host controller
 > > hardware.  The ifxhcd driver doesn't look so hot, at first glance.
 > > Maybe it can be fixed up...
 >
 > Looking at drivers/usb/host I see quite short files which I understand
 > as hardware-specific (ex: 551 lines for ehci-terga.c). Such length

 Those are just bus glue; they provide the necessary information for
 hooking up the platform's EHCI implementation to the main ehci-hcd
 driver.

 > should be in my reach, but the 6k lines of some non-[eo]hci are very
 > likely not. I'll try to understand how this controller would fit in.

 If your hardware resembles, say, the DWC2 hardware then maybe the DWC2
 driver (drivers/usb/dwc2/) can be made to work with it.
 }}}

 I'll continue working on this, but I am likely to get swamped in kernel
 code.

 Any other ideas from an OpenWRT point of view ? For example, an updated
 driver known to be somewhere but not merged, or some contact with
 Lantiq/Synopsys, ...

 Or from a pure kernel point of view, any idea of what in-tree driver would
 likely be close to ltq-hcd, or what would be the signs I should look for
 myself ?
 I downloaded the ehci spec to at least see if by any chance this hcd would
 conform to that. I am rather familiar with USB, but either userspace (I
 maintain a python wrapper for libusb) or on the wire (I also maintain a -
 userspace - driver for an USB protocol analyzer).

--
Ticket URL: <https://dev.openwrt.org/ticket/18424>
OpenWrt <http://openwrt.org>
Opensource Wireless Router Technology
_______________________________________________
openwrt-tickets mailing list
[email protected]
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-tickets

Reply via email to