>>>But I believe the backtrace points to the bad context path, not the root >>>cause which triggered the HC halt. >> >>Likely. If you can get me more information on that, let me know. > > > I think it's somehow related to the td processing. I'm getting a lot > of "bad entry" messages from ohci-q with 2.5.42 and usbtest. With and > without sglist. I've another piece of code which streams both in and > out using a pool of urbs. It works 100% stable with 2.4.20-pre8 but > usually fails with 2.5.42 within seconds reporting "bad entry".
Hmm ... I saw this a couple of times in 2.5.40, but not reproducibly and only with a kernel when some other really wierd stuff was being seen. Which included some memory trashing that was clearly not caused by OHCI, since it also showed up without OHCI having been loaded. Do you have any memory trashing symptoms like that? > It looks to me like some kind of donelist corruption but might be ... Given what you sent, I'd suspect someone's trashing a TD that the HC is using, so it then appears on the donelist and the controller halts because giving it bad data confuses its little silicon brain. A good thing to try would assigning a magic word after the hw_* fields during allocation, checking it in places like free and donelist processing, and printing error diagnostics (td contents) if it's ever wrong. Alternatively, and not dissimilar, a TD is getting freed a bit early, and then when its poisoned on free (this can't happen if you're not running with memory debug enabled) the HC is fetching from a7a7a7a0 and using that data as a TD. I think if it were an ED in this boat the symptoms would be different, but that's also a possibility. A while back there was a similar bug that was caused by freeing the dummy TD, which would be bad if the HC was still using it. That particular bug is now gone (and the patch to 2.4 seems to have fixed a lot of random OHCI issues). > drivers/usb/core/hub.c: new USB device 00:0c.0-3.2, assigned address 4 > drivers/usb/core/message.c: usb_control/bulk_msg: timeout > drivers/usb/host/ohci-dbg.c: UNLINK cb3a96dc >dev:4,ep=0-I,CTRL,flags:0,len:0/8,stat:-2 > drivers/usb/core/hcd.c: 00:0c.0: wait for giveback urb cb3a96dc > drivers/usb/host/ohci-q.c: 00:0c.0 bad entry 3080000 That means it found 0308 0000 on the donelist, which was "bad" since there was no record of that DMA address. Given that value (more on that issue later) it's not surprising that the HC reported some kind of fatal error before much longer. > drivers/usb/host/ohci-hcd.c: OHCI Unrecoverable Error, 00:0c.0 disabled > drivers/usb/host/ohci-dbg.c: OHCI controller 00:0c.0 state > drivers/usb/host/ohci-dbg.c: OHCI 1.0, with legacy support registers > drivers/usb/host/ohci-dbg.c: control: 0x0000009f HCFS=operational CLE IE PLE CBSR=3 > drivers/usb/host/ohci-dbg.c: cmdstatus: 0x00000000 SOC=0 > drivers/usb/host/ohci-dbg.c: intrstatus: 0x00000076 RHSC FNO UE SF WDH > drivers/usb/host/ohci-dbg.c: intrenable: 0x80000012 MIE UE WDH > drivers/usb/host/ohci-dbg.c: ed_controlhead 0b39c080 So there was a control ED with that DMA address and that was either the only active ED, or all the others were also control EDs. Notice how that DMA address starts 0b39 not 0308 ... the CPU would use cb39. > drivers/usb/host/ohci-dbg.c: hcca frame #3d46 > drivers/usb/host/ohci-dbg.c: roothub.a: ff000203 POTPGT=255 NPS NDP=3 > drivers/usb/host/ohci-dbg.c: roothub.b: 00000000 PPCM=0000 DR=0000 > drivers/usb/host/ohci-dbg.c: roothub.status: 00000000 > drivers/usb/host/ohci-dbg.c: 00:0c.0: roothub.portstatus [0] = 0x00000100 PPS > drivers/usb/host/ohci-dbg.c: 00:0c.0: roothub.portstatus [1] = 0x00000100 PPS > drivers/usb/host/ohci-dbg.c: 00:0c.0: roothub.portstatus [2] = 0x00000103 PPS PES >CCS > drivers/usb/host/ohci-hcd.c: USB HC reset_hc 00:0c.0: ctrl = 0x9f ; > drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a96dc pipe 80000480, current status >-115 > drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a9674 pipe 40408280, current status >-115 > drivers/usb/core/hcd.c: shutdown 00:0c.0 urb cb3a9264 pipe 40408180, current status >-115 Three IN endpoints, control and two interrupt, three different devices, were all doing I/O. You weren't trying to break anything by sending past two unpowered hubs, or anything electrically sadistic like that, were you? :) > pci_pool_destroy 00:0c.0/ohci_td, cb39b000 busy > pci_pool_destroy 00:0c.0/ohci_ed, cb39c000 busy Also interesting. It _should_ have cleaned up. Testing "cleanup after disaster" code is still on my 2.5 list; awkward to reproduce! But more important, those addresses also didn't start with 0308 (or more like c308). Which strongly suggests that the 0308 "dma address" likely came from overwriting a TD with some other data. > usb: raced timeout, pipe 0x80000480 status -108 time left 0 You know, every time I look at that synchronous control/bulk code I have to ask if some line isn't a bug. In this case something certainly is a bug! - Dave ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel