Kevin Brosius wrote:
After some testing with the ohci driver, I'm starting to suspect this is
a documented problem in the AMD766 controller on the motherboard.
Referring to
http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_873_1130,00.html
and their "AMD-766 Peripheral Bus Controller Data Sheet", I am
suspicious of Erratum Number 20. To my understanding this describes a
case where the HccaDoneHead may not be updated at the end of a frame.
The Erratum fails to mention if the WDH (OHCI_INTR_WDH) interrupt will
still occur or not.
You mean the "Revision Guide", which is where the errata are listed.
Right, erratum 18 doesn't look to be much trouble, unlike erratum 20.
Isn't this the MP chipset where most mobo vendors said "USB doesn't
work", and shipped add-in cards (like NEC EHCI) so customers would
have a working solution, when they needed it?
I've spent some time instrumenting ohci-hcd.c and ohci-q.c in
drivers/usb/host, and it looks like I do not get a WDH interrupt when I
see the failure. I'm not sure I've proved this conclusively, as it's
possible the WDH interrupt occurs, but the ohci->hcca->done_head
register is 0. The Erratum suggests the pointer will be lost and never
written into DoneHead. I'd guess that means the register will be 0 from
the last time the driver cleared it, and the HC did never update it.
Sounds about right. Except that the HCCA is main memory, and WDH will
never happen unless there was a non-null donelist head to write. (There
is a real register, which HCCA doesn't exactly shadow.)
Basically what happens is that all recent work the HC did, and tracked
in ohci->regs->donehead, gets lost forever when the controller bungles
writing that into ohci->hcca->done_head (sometimes, if SOF is pending).
I don't see any straightforward ways to workaround this problem. Am I
correct in my understanding that the ohci driver will lock in the case
of a missed WDH interrupt w/lost DoneHead pointer? It looks to me like
the driver relies on the DoneHead coming back from the HC successfully,
and cannot recover without it (dl_reverse_done_list() in ohci-hcd.c).
Since hcca->done_head is documented as the primary way completed TDs
get returned to the driver, and it's the only one the OHCI driver
currently uses, yes that'll cause much trouble.
Given the details in erratum 20, I can imagine trying to work around
(and detect!) it, using a quirk flag to enable logic in the irq handler
so it goes something like:
- nothing extra to do if SOF isn't enabled, else:
- read ohci->regs->donehead before clearing WDH status;
nothing to do if it's zero. (There'd still be a race
when a TD completes quickly enough, though it wouldn't
be all that common at USB 1.1 speeds.)
- process the donelist. (gives the hardware time to
bungle the write, if it's going to do so).
- use that saved donehead value to determine if the
write got bungled: current ohci->regs->donehead is
zero, but hcca->done_head is too. (there'd be a few
other cases too, but that's the obvious one.)
- if it bungled, recover. maybe just by treating that
saved value as a new donelist, reversing it etc;
Though I'm not sure how this erratum would cause one
of your symptoms: no more WDH interrupts. Maybe it
ties in to erratum 18, and you'd need to write the
normally read-only regs->donehead. Or maybe that's
just because your device wasn't resubmitting; does
something like "lsusb" fail too?
With my limited knowledge of the driver, it seems possible to enable the
SOF interrupts and then maybe keep the done list up to date at that
time. That seems like an excessive penalty to pay in additional
interrupts though.
SOF is what causes the trouble ... you want to avoid using it more.
Any thoughts or suggestions on the best way to proceed? I'm presently
pursuing being able to recover any way possible, probably by maintaining
the TD/URB list in the driver, so that I can check the list on SOF
interrupts and catch the missing DoneHead entry. Is that possible?
This is mostly to prove the problem, rather than as a optimal fix.
See above ... that should help confirm whether this really is what you're
seeing. Which is certainly a good idea. If it isn't, then it'd seem you
are finding some new bug (might still be hardware).
The OHCI driver already maintains a TD list (ed->td_list) in the 2.5 code.
I've even thought about changing the OHCI driver so it scans the schedule,
much like how ehci-hcd works (or even uhci-hcd). That schedule is the ED
lists in ohci->ed_{control,bulk}tail and ohci->periodic[]. Basically we
know that at most one td in ed->td_list is at ed->hwHeadP, and also that
any TD before that has been retired by the hardware ... regardless of
what the donelist tells us (or doesn't).
That is, the donelist mechanism is just letting us avoid scanning the
active EDs: it's more efficient. But in cases where the donelist
support touches hardware bugs (I've heard other reports of this, for
non-PCI hardware), it's possible to find completed TDs another way.
(Mixing the two modes would be unhealthy.)
I've avoided that change since it would destabilize things, as well
as slow things down a bit. Plus, unlike most of the other 2.5 changes
hasn't (so far) been necessary to fix bugs.
- Dave
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel