Hi Bob,

On 07/09/13(Sat) 08:14, RD Thrush wrote:
> Since appx. November, 2012, I've had 2 amd64 systems hang while
> spewing "ehci_idone: ex=0xffff800000...... is done!" messages to the
> serial console.  The hangs are intermittent.  The system is
> unresponsive to the keyboard and doesn't respond to network ping.  A
> hardware reset is necessary to regain control.
> 
> In order to help troubleshoot, I patched /usr/src/sys/dev/usb/ehci.c
> to panic when the forementioned message had occurred 9 times and then
> built a custom kernel with EHCI_DEBUG defined.  In the past day, the
> new panic has occurred on the same machine with both an mp and sp
> kernel and I have collected basic ddb information as well as crash
> dumps.
> 
> Will the ddb results from my patch [below] help troubleshoot the hang?
> If so, the largish console logs, usbdevs, pcidump and acpidump are
> located at <http://arp.thrush.com/openbsd/ehci_idone/x4/>.
> 
> NB: ehcidebug=0 in the sp session, while ehcidebug=3 or 2 in the mp session.
> Setting ehcidebug=3 seemed to hang but I was able to interrupt ddb, set
> ehcidebug=2 and continue the ddb session.
> 
> I appreciate any help diagnosing this problem.

Thanks for this detailed bug report.

You're saying that you have 2 amd64 systems with the same problem but
I see only the dmesg for one machine, does the other has the same ehci
controller?

The problem you are seeing is related to the way ehci transfers are
aborted.  The abortion process is subtly broken.

For the archives what happens in your case is that the timeout for
one of the transfers fires and enqueue an abort task (ehci_timeout
in your log).  This abort task get scheduled tries to deactivate
the qTDs, asks for an Interrupt on Async Advance Doorbell and goes
to sleep (ehci_sync_hc in your log).
Then the interrupt happens (ehci_intr1: door bell), wakeups the
task and goes into the softinterrupt path to process the finished
transfers.  Here the driver discovers that the transfer that timed
out is finished (whoa!) and tries to handles it.  But since this 
transfer has been marked as TIMEOUT (ehci_idone: aborted in your
log), it does nothing and bails.  

Apparently the abort task never get rescheduled and your transfer
is never removed from the list, certainly because the hardware 
keeps interrupting your systems, so you're livelock ;)

But all of that happens because a timeout fires for one of your
transfers, apparently some ATI controllers needs one more quirk,
as your problem looks like a dropped interrupt.  Does the diff
below helps? 

Index: pci/ehci_pci.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/pci/ehci_pci.c,v
retrieving revision 1.26
diff -u -p -r1.26 ehci_pci.c
--- pci/ehci_pci.c      15 Apr 2013 09:23:01 -0000      1.26
+++ pci/ehci_pci.c      10 Sep 2013 08:19:28 -0000
@@ -204,10 +204,16 @@ ehci_pci_attach(struct device *parent, s
        else
                snprintf(sc->sc.sc_vendor, sizeof(sc->sc.sc_vendor),
                    "vendor 0x%04x", PCI_VENDOR(pa->pa_id));
-       
+
        /* Enable workaround for dropped interrupts as required */
-       if (sc->sc.sc_id_vendor == PCI_VENDOR_VIATECH)
+       switch (sc->sc.sc_id_vendor) {
+       case PCI_VENDOR_ATI:
+       case PCI_VENDOR_VIATECH:
                sc->sc.sc_flags |= EHCIF_DROPPED_INTR_WORKAROUND;
+               break;
+       default:
+               break;
+       }
 
        ehci_pci_takecontroller(sc, 0);
        r = ehci_init(&sc->sc);
Index: usb/ehci.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.134
diff -u -p -r1.134 ehci.c
--- usb/ehci.c  12 Jun 2013 11:42:01 -0000      1.134
+++ usb/ehci.c  10 Sep 2013 08:19:28 -0000
@@ -2943,11 +2943,11 @@ ehci_timeout_task(void *addr)
 }
 
 /*
- * Some EHCI chips from VIA seem to trigger interrupts before writing back the
- * qTD status, or miss signalling occasionally under heavy load.  If the host
- * machine is too fast, we we can miss transaction completion - when we scan
- * the active list the transaction still seems to be active.  This generally
- * exhibits itself as a umass stall that never recovers.
+ * Some EHCI chips from VIA / ATI seem to trigger interrupts before writing
+ * back the qTD status, or miss signalling occasionally under heavy load.
+ * If the host machine is too fast, we can miss transaction completion - when
+ * we scan the active list the transaction still seems to be active. This
+ * generally exhibits itself as a umass stall that never recovers.
  *
  * We work around this behaviour by setting up this callback after any softintr
  * that completes with transactions still pending, giving us another chance to

Reply via email to