On Sat, 13 Sep 2014, Joe Lawrence wrote:
> Hi Alan,
>
> I've collected 16 crashes since kicking off automated tests a little
> over 24 hrs ago.
>
> Each crash hit the BUG in qh_destroy() and the only unique debugging
> printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0".
What about error messages from the check_async_ring routine?
> I can include (or upload) a full (or filtered) vmcore-dmesg.txt if that
> would be more helpful.
Only if it includes one of those messages.
> The debug code I added as you suggested is provided below...
>
> Thanks,
>
> -- Joe
>
>
> diff -Nupr before/drivers/usb/host/ehci.h after/drivers/usb/host/ehci.h
> --- before/drivers/usb/host/ehci.h 2014-07-16 14:25:31.000000000 -0400
> +++ after/drivers/usb/host/ehci.h 2014-09-12 15:47:18.943478397 -0400
> @@ -231,6 +231,8 @@ struct ehci_hcd { /* one per controlle
>
> /* platform-specific data -- must come last */
> unsigned long priv[0] __aligned(sizeof(s64));
> +
> + int num_async;
> };
Not a good idea to put your new field _after_ a field that is
documented as coming last, but in this case it probably won't hurt.
> /* convert between an HCD pointer and the corresponding EHCI_HCD */
> diff -Nupr before/drivers/usb/host/ehci-hcd.c
> after/drivers/usb/host/ehci-hcd.c
> --- before/drivers/usb/host/ehci-hcd.c 2014-07-16 14:25:31.000000000
> -0400
> +++ after/drivers/usb/host/ehci-hcd.c 2014-09-12 15:53:16.863163627 -0400
> @@ -440,6 +440,8 @@ static void ehci_stop (struct usb_hcd *h
> if (ehci->amd_pll_fix == 1)
> usb_amd_dev_put();
>
> + pr_err("%s: ehci->num_async = %d\n", __func__, ehci->num_async);
> +
> dbg_status (ehci, "ehci_stop completed",
> ehci_readl(ehci, &ehci->regs->status));
> }
Always 0, as it should be.
> diff -Nupr before/drivers/usb/host/ehci-q.c after/drivers/usb/host/ehci-q.c
> --- before/drivers/usb/host/ehci-q.c 2014-07-16 14:25:31.000000000 -0400
> +++ after/drivers/usb/host/ehci-q.c 2014-09-12 15:52:08.023292291 -0400
> @@ -959,6 +959,24 @@ static void disable_async(struct ehci_hc
> ehci_poll_ASS(ehci);
> }
>
> +static void check_async_ring(struct ehci_hcd *ehci, int add)
> +{
> + struct ehci_qh *qh;
> + int n;
> +
> + qh = ehci->async->qh_next.qh;
> + n = ehci->num_async += add;
> + while (qh && n > 0) {
> + qh = qh->qh_next.qh;
> + --n;
> + }
> + if (qh || n != 0) {
> + ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n",
> + ehci->num_async, n, qh);
> + BUG();
> + }
> +}
In the last call to this function, ehci->num_async must get set to 0
(because that is the final value as printed out in ehci_stop). This
means n = 0, which means ehci->async->qh_next.qh must be NULL. And
this routine is the only place where ehci->num_async gets changed.
But then ehci->async->qh_next.qh isn't NULL when the final qh_destroy
call is made. This suggests that something changes it in the meantime.
You can check for this. Sprinkle ehci_info messages throughout
ehci_stop, printing the value of ehci->async->qh_next.qh. It should be
NULL the entire time.
Alan Stern
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html