>>>But I believe the backtrace points to the bad context path, not the root 
>>>cause which triggered the HC halt.
>>
>>Likely.  If you can get me more information on that, let me know.
> 
> 
> I think it's somehow related to the td processing. I'm getting a lot
> of "bad entry" messages from ohci-q with 2.5.42 and usbtest. With and 
> without sglist. I've another piece of code which streams both in and 
> out using a pool of urbs. It works 100% stable with 2.4.20-pre8 but 
> usually fails with 2.5.42 within seconds reporting "bad entry".

Hmm ... I saw this a couple of times in 2.5.40, but not reproducibly and
only with a kernel when some other really wierd stuff was being seen.

Which included some memory trashing that was clearly not caused by OHCI,
since it also showed up without OHCI having been loaded.  Do you have
any memory trashing symptoms like that?


> It looks to me like some kind of donelist corruption but might be ...

Given what you sent, I'd suspect someone's trashing a TD that the HC
is using, so it then appears on the donelist and the controller halts
because giving it bad data confuses its little silicon brain.  A good
thing to try would assigning a magic word after the hw_* fields during
allocation, checking it in places like free and donelist processing,
and printing error diagnostics (td contents) if it's ever wrong.

Alternatively, and not dissimilar, a TD is getting freed a bit early,
and then when its poisoned on free (this can't happen if you're not
running with memory debug enabled) the HC is fetching from a7a7a7a0
and using that data as a TD.  I think if it were an ED in this boat
the symptoms would be different, but that's also a possibility.

A while back there was a similar bug that was caused by freeing the
dummy TD, which would be bad if the HC was still using it.  That
particular bug is now gone (and the patch to 2.4 seems to have fixed
a lot of random OHCI issues).


> drivers/usb/core/hub.c: new USB device 00:0c.0-3.2, assigned address 4
> drivers/usb/core/message.c: usb_control/bulk_msg: timeout
> drivers/usb/host/ohci-dbg.c: UNLINK cb3a96dc 
>dev:4,ep=0-I,CTRL,flags:0,len:0/8,stat:-2
> drivers/usb/core/hcd.c: 00:0c.0: wait for giveback urb cb3a96dc
> drivers/usb/host/ohci-q.c: 00:0c.0 bad entry  3080000

That means it found 0308 0000 on the donelist, which was "bad" since
there was no record of that DMA address.  Given that value (more on
that issue later) it's not surprising that the HC reported some kind
of fatal error before much longer.


> drivers/usb/host/ohci-hcd.c: OHCI Unrecoverable Error, 00:0c.0 disabled
> drivers/usb/host/ohci-dbg.c: OHCI controller 00:0c.0 state
> drivers/usb/host/ohci-dbg.c: OHCI 1.0, with legacy support registers
> drivers/usb/host/ohci-dbg.c: control: 0x0000009f HCFS=operational CLE IE PLE CBSR=3
> drivers/usb/host/ohci-dbg.c: cmdstatus: 0x00000000 SOC=0
> drivers/usb/host/ohci-dbg.c: intrstatus: 0x00000076 RHSC FNO UE SF WDH
> drivers/usb/host/ohci-dbg.c: intrenable: 0x80000012 MIE UE WDH
> drivers/usb/host/ohci-dbg.c: ed_controlhead 0b39c080

So there was a control ED with that DMA address and that was either the
only active ED, or all the others were also control EDs.  Notice how that
DMA address starts 0b39 not 0308 ...  the CPU would use cb39.


> drivers/usb/host/ohci-dbg.c: hcca frame #3d46
> drivers/usb/host/ohci-dbg.c: roothub.a: ff000203 POTPGT=255 NPS NDP=3
> drivers/usb/host/ohci-dbg.c: roothub.b: 00000000 PPCM=0000 DR=0000
> drivers/usb/host/ohci-dbg.c: roothub.status: 00000000
> drivers/usb/host/ohci-dbg.c: 00:0c.0:  roothub.portstatus [0] = 0x00000100 PPS
> drivers/usb/host/ohci-dbg.c: 00:0c.0:  roothub.portstatus [1] = 0x00000100 PPS
> drivers/usb/host/ohci-dbg.c: 00:0c.0:  roothub.portstatus [2] = 0x00000103 PPS PES 
>CCS
> drivers/usb/host/ohci-hcd.c: USB HC reset_hc 00:0c.0: ctrl = 0x9f ;
> drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a96dc pipe 80000480, current status 
>-115
> drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a9674 pipe 40408280, current status 
>-115
> drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a9264 pipe 40408180, current status 
>-115

Three IN endpoints, control and two interrupt, three different devices,
were all doing I/O.  You weren't trying to break anything by sending
past two unpowered hubs, or anything electrically sadistic like that,
were you?  :)


> pci_pool_destroy 00:0c.0/ohci_td, cb39b000 busy
> pci_pool_destroy 00:0c.0/ohci_ed, cb39c000 busy

Also interesting.  It _should_ have cleaned up.  Testing "cleanup
after disaster" code is still on my 2.5 list; awkward to reproduce!

But more important, those addresses also didn't start with 0308 (or
more like c308).  Which strongly suggests that the 0308 "dma address"
likely came from overwriting a TD with some other data.


> usb: raced timeout, pipe 0x80000480 status -108 time left 0

You know, every time I look at that synchronous control/bulk code
I have to ask if some line isn't a bug.  In this case something
certainly is a bug!

- Dave





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to