On Tue, 16 Sep 2014 10:44:45 -0400
Alan Stern <[email protected]> wrote:
> On Tue, 16 Sep 2014, Joe Lawrence wrote:
>
> > > You can check for this. Sprinkle ehci_info messages throughout
> > > ehci_stop, printing the value of ehci->async->qh_next.qh. It should be
> > > NULL the entire time.
> >
> > I re-ran the tests with the following changes:
> >
> > - moved num_async debug up above struct ehci_hcd.priv[0] :)
> > - added verbose debug msg in check_async_ring to display the new
> > num_async value
> > - added ehci_info to every other line in ehci_stop
> >
> > % grep '0000:66:00.1' vmcore-dmesg.txt
> >
> > [ 4936.956739] pci 0000:66:00.1: [8086:1d2d] type 00 class 0x0c0320
> > [ 4936.956789] pci 0000:66:00.1: reg 0x10: [mem 0xa0000400-0xa00007ff]
> > [ 4936.957019] pci 0000:66:00.1: PME# supported from D0 D3hot D3cold
> > [ 4939.564933] ehci-pci 0000:66:00.1: EHCI Host Controller
> > [ 4939.565462] ehci-pci 0000:66:00.1: new USB bus registered, assigned bus
> > number 4
> > [ 4939.565498] ehci-pci 0000:66:00.1: debug port 2
> > [ 4939.569466] ehci-pci 0000:66:00.1: cache line size of 64 is not supported
> > [ 4939.569533] ehci-pci 0000:66:00.1: irq 13, io mem 0xa0000400
> > [ 4939.574794] ehci-pci 0000:66:00.1: USB 2.0 started, EHCI 1.00
> > [ 4939.575410] usb usb4: SerialNumber: 0000:66:00.1
> > [ 4940.089044] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4941.795700] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 0, qh= (null)
> > [ 4941.895713] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4941.943761] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 0, qh= (null)
> > [ 4942.002715] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.064346] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 0, qh= (null)
> > [ 4942.156646] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.226028] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 0, qh= (null)
> > [ 4942.274185] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.329196] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 0, qh= (null)
> > [ 4942.391055] ehci-pci 0000:66:00.1: check_async_ring:970
> > ehci(ffff8808419839d0)->num_async = 1, qh=ffff88083d6b4558
> > [ 4942.629653] ehci-pci 0000:66:00.1: HC died; cleaning up
> > [ 4945.663153] ehci-pci 0000:66:00.1: remove, state 1
> > [ 4945.733934] ehci-pci 0000:66:00.1: ehci_stop:420
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.774242] ehci-pci 0000:66:00.1: ehci_stop:427
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.814791] ehci-pci 0000:66:00.1: ehci_stop:429
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.855938] ehci-pci 0000:66:00.1: ehci_stop:431
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.897145] ehci-pci 0000:66:00.1: ehci_stop:433
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.938791] ehci-pci 0000:66:00.1: ehci_stop:436
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4945.980491] ehci-pci 0000:66:00.1: ehci_stop:438
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.022221] ehci-pci 0000:66:00.1: ehci_stop:440
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.063936] ehci-pci 0000:66:00.1: ehci_stop:444
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.105631] ehci-pci 0000:66:00.1: ehci_stop:446
> > ehci(ffff8808419839d0)->num_async = 1
> > [ 4946.241802] ehci-pci 0000:66:00.1: ehci_stop:448
> > ehci(ffff8808419839d0)->num_async = 1
> >
> > where ehci_stop line 420 is function entry and line 448 function
> > return. Notice that check_async_ring incremented num_async just before
> > the HC died/removed and then num_async was 1 throughout ehci_stop.
> >
> > qh_destroy then BUGs with a struct ehci_hcd *ehci of ffff8808419839d0.
>
> This says that ehci->num_async ended up equal to 1. But in your
> previous email, you said the final printout indicated it was equal to
> 0. Did something change in the meantime?
I just double checked the logs, and that is correct. However, the
earlier reports of a dozen or so crashes included num_async at the end
of struct ehci_hcd. Perhaps the priv[] array clobbered the value?
Sorry for the confusion.
> Anyway, the log above means that a QH was linked before the HC died,
> but then it was never unlinked. Please add a line at the start of
> ehci_endpoint_disable, printing out the value of ep->hcpriv for
> comparison with the values listed above.
Will do.
> Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
> at *(ehci->async->qh_next.qh). You can get those right away from your
> crash dump data, right?
Yep, interestingly enough, *(ehci->async->qh_next.qh) appears to be
mostly full of zeros:
------------[ cut here ]------------
kernel BUG at drivers/usb/host/ehci-mem.c:72!
invalid opcode: 0000 [#1] SMP
CPU: 16 PID: 17572 Comm: kworker/u66:1 Tainted: PF W O--------------
3.10.0-123.6.3.el7.bigphysarea_expedient1_usb3.x86_64 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
Workqueue: ftworker io_poll [ftmod]
task: ffff881051320000 ti: ffff8810440aa000 task.ti: ffff8810440aa000
RIP: 0010:[<ffffffff81412136>] [<ffffffff81412136>] qh_destroy+0x66/0x90
RSP: 0018:ffff8810440ab870 EFLAGS: 00010246
RAX: ffff88085202b588 RBX: ffff88085202b570 RCX: 0000000000000000
RDX: ffff88107fc10040 RSI: ffff88085202b570 RDI: ffff8808419839d0
RBP: ffff8810440ab880 R08: 0000000000000082 R09: 00000000000061c8
R10: 000000000000000f R11: 0000000007070707 R12: ffff8808419839d0
R13: ffff880841983abc R14: ffff88084e425700 R15: 0000000000000048
FS: 0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7727dfe010 CR3: 000000104198c000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff8808419839d0 ffff8808419839d0 ffff8810440ab8c0 ffffffff81413388
0000000000000000 0000000000000046 ffff8808419837b0 ffff8808419839d0
ffff880841983abc ffff88084e425700 ffff8810440ab948 ffffffff81417d0f
Call Trace:
[<ffffffff81413388>] ehci_mem_cleanup+0x28/0x180
[<ffffffff81417d0f>] ehci_stop+0x29f/0x410
[<ffffffff8106e35b>] ? lock_timer_base.isra.35+0x2b/0x50
[<ffffffff8106f04e>] ? try_to_del_timer_sync+0x5e/0x90
[<ffffffff813f8d38>] usb_remove_hcd+0x108/0x1d0
[<ffffffff8140a15f>] usb_hcd_pci_remove+0x12f/0x150
[<ffffffff812ec5bb>] pci_device_remove+0x3b/0xb0
[<ffffffff813b6b9f>] __device_release_driver+0x7f/0xf0
[<ffffffff813b6c33>] device_release_driver+0x23/0x30
[<ffffffff813b63a8>] bus_remove_device+0x108/0x180
[<ffffffff813b2955>] device_del+0x135/0x1d0
[<ffffffff812e54b4>] pci_stop_bus_device+0x94/0xa0
[<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812e545b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812e55a2>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffffa02ea917>] ioboard_bringdown+0xa7/0xe0 [ftmod]
[<ffffffffa02eac07>] ftmod_ioboard_event+0xd7/0x400 [ftmod]
[<ffffffffa0538a00>] ? OS_READ_PORT_UINT16+0x20/0x20 [ccmod]
[<ffffffffa02e87b2>] io_state_change+0x92/0x130 [ftmod]
Quick double-check that assembly still left registers in expected places:
drivers/usb/host/ehci-mem.c: 68
<qh_destroy+0xa>: mov %rsp,%rbp
<qh_destroy+0xd>: push %r12
<qh_destroy+0xf>: mov %rdi,%r12
...
drivers/usb/host/ehci-mem.c: 68
<qh_destroy+0x17>: mov %rsi,%rbx
so:
R12: struct ehci_hcd ffff8808419839d0 << ehci
RBX: struct ehci_qh ffff88085202b570 << qh
crash> struct ehci_hcd ffff8808419839d0 | grep async
...
async = 0xffff88085202b570, << ehci->async
num_async = 0x1,
...
crash> struct ehci_qh 0xffff88085202b570
struct ehci_qh {
hw = 0xffff88007ed0b000,
qh_dma = 0x7ed0b000,
qh_next = {
qh = 0xffff88083d6b4558, << ehci->async->qh_next.qh
itd = 0xffff88083d6b4558,
sitd = 0xffff88083d6b4558,
fstn = 0xffff88083d6b4558,
hw_next = 0xffff88083d6b4558,
ptr = 0xffff88083d6b4558
},
qtd_list = {
next = 0xffff88085202b588,
prev = 0xffff88085202b588
},
intr_node = {
next = 0x0,
prev = 0x0
},
dummy = 0xffff88007895f000,
unlink_node = {
next = 0xffff88085202b5b0,
prev = 0xffff88085202b5b0
},
unlink_cycle = 0x0,
qh_state = 0x1,
xacterrs = 0x0,
usecs = 0x0,
gap_uf = 0x0,
c_usecs = 0x0,
tt_usecs = 0x0,
period = 0x0,
start = 0x0,
dev = 0x0,
is_out = 0x0,
clearing_tt = 0x0,
dequeue_during_giveback = 0x0,
exception = 0x0
}
crash> struct ehci_qh 0xffff88083d6b4558 << *(ehci->async->qh_next.qh)
struct ehci_qh {
hw = 0x0,
qh_dma = 0x0,
qh_next = {
qh = 0x0,
itd = 0x0,
sitd = 0x0,
fstn = 0x0,
hw_next = 0x0,
ptr = 0x0
},
qtd_list = {
next = 0x0,
prev = 0x0
},
intr_node = {
next = 0x0,
prev = 0x0
},
dummy = 0x0,
unlink_node = {
next = 0x0,
prev = 0x0
},
unlink_cycle = 0x6,
qh_state = 0x0,
xacterrs = 0x0,
usecs = 0x0,
gap_uf = 0x0,
c_usecs = 0x0,
tt_usecs = 0x0,
period = 0x0,
start = 0x0,
dev = 0x6,
is_out = 0x0,
clearing_tt = 0x0,
dequeue_during_giveback = 0x0,
exception = 0x0
}
Regards,
-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html