Hi,
I just tried 3.9 kernel with pcie_aspm=off and in another attempt with
pcie_aspm=native.
I realized the message "HW died" happens only in the former case.
I believe this is a bug. If I unplug an express card with a NEC-based USB3
host
it should be properly terminated, and xhci_hcd should unbind *even* when
"HW died" happened. It is not the case now so I have to do:
echo 1 > /sys/bus/pci/devices/0000:11:00.0/remove
to get rid of the stale 11:00 device from my system (sysfs entries):
/proc/iomem
f1104000-f1104fff : r8169
f6800000-f6bfffff : 0000:00:02.0
f6c00000-f7cfffff : PCI Bus 0000:11
- f6c00000-f6c01fff : 0000:11:00.0
- f6c00000-f6c01fff : xhci_hcd
f7d00000-f7dfffff : PCI Bus 0000:0b
f7d00000-f7d0ffff : 0000:0b:00.0
f7d00000-f7d0ffff : xhci_hcd
/proc/interrupts:
- 45: 1 0 PCI-MSI-edge xhci_hcd
- 46: 0 0 PCI-MSI-edge xhci_hcd
- 47: 0 0 PCI-MSI-edge xhci_hcd
Let's say that when pcie_aspm=off the first hot eject of the express card
with the USB3.0 controller does not result in "HW died" but in "HC error
bitmask = 0x4",
whatever that means. That is because of pciehp being broken under pcie_aspm=off
(unlike under pcie_aspm=native) but is not the story for linux-usb.
[ 62.960729] xhci_hcd 0000:0b:00.0: Poll event ring: 4294943584
[ 62.960732] xhci_hcd 0000:11:00.0: Poll event ring: 4294943584
[ 62.960757] xhci_hcd 0000:11:00.0: op reg status = 0x0
[ 62.960763] xhci_hcd 0000:11:00.0: ir_set 0 pending = 0x2
[ 62.960764] xhci_hcd 0000:11:00.0: HC error bitmask = 0x4
[ 62.960765] xhci_hcd 0000:11:00.0: Event ring:
[ 62.960768] xhci_hcd 0000:11:00.0: @00000000d6020400 d6020000 00000000
01003028 0000c001
[ 62.960769] xhci_hcd 0000:0b:00.0: op reg status = 0x0
[ 62.960771] xhci_hcd 0000:11:00.0: @00000000d6020410 00000000 00000000
00000000 00000000
[ 62.960772] xhci_hcd 0000:11:00.0: @00000000d6020420 00000000 00000000
00000000 00000000
[ 62.960773] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
[ 62.960775] xhci_hcd 0000:11:00.0: @00000000d6020430 00000000 00000000
00000000 00000000
[ 62.960776] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
[ 62.960777] xhci_hcd 0000:11:00.0: @00000000d6020440 00000000 00000000
00000000 00000000
The kernel is still looking for the device, silly, the device is ejected from
the express card
slot already:
+[ 62.961160] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr low bits +
flags = @00000008
+[ 62.961161] xhci_hcd 0000:11:00.0: // xHC command ring deq ptr high bits =
@00000000
A subsequent hot re-insert of the card is unnoticed by pciehp (due to a bug
cause by pcie_aspm=off)
and therefore, xhci_hcd is puzzled and spits out:
+[ 123.191537] xhci_hcd 0000:0b:00.0: Poll event ring: 4294949600
+[ 123.191547] xhci_hcd 0000:11:00.0: Poll event ring: 4294949600
+[ 123.191557] xhci_hcd 0000:11:00.0: op reg status = 0xffffffff
+[ 123.191563] xhci_hcd 0000:0b:00.0: op reg status = 0x0
+[ 123.191570] xhci_hcd 0000:0b:00.0: ir_set 0 pending = 0x2
+[ 123.191574] xhci_hcd 0000:11:00.0: HW died, polling stopped.
+[ 123.191580] xhci_hcd 0000:0b:00.0: HC error bitmask = 0x0
At this step xhci_hcd should unbind the dead device so that it's sysfs entries
could be removed
(bot iomem and interrupts). If that doe not happen or is not done manually a
subsequent
hot insert has no chance to succeed and will silently proceed but device is
left unconfigured
and sysfs entries show just crappy cached values. This can be demonstrated when
a desperate users
inserts a different express card (a mixture of both is shown in lspci entries
but only the old
data in sysfs entries). Lets cleanup the mess and ensure xhci_hcd releases
resources allocated
by the dead device.
I speculate the "HC error bitmask = 0x4" should result in a "HW died" case as
well.
Thank you,
Martin
P.S.: Collected dmesg/lspci/iomem/interrupts data are at:
http://195.113.57.32/~mmokrejs/tmp/20130430.tar.bz2
in 3.9/off subdirectory (the pcie_aspm=off case). The working pcie_aspm=native
behavior is documented
under 3.9/native subdirectory.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html