Hi,
I can see in the archives that a problem with the USB storage driver has been discussed recently. I just got a USB 2.0 hard drive enclosure (made by a company called Anydisk), and it works out of the box with the usb-storage driver off all USB hosts I have tried so far (EHCI on a VIA chipset, UHCI on an Intel PIIX4). Power supply may be a problem, but I use a dedicated external supply for the disk.
I do notice, however, that sometimes during heavy use (copying files, calculating MD5 checksums, even just reading files in chunks of 60MB with several minutes of pauses) causes a timeout, and directly afterwards an oops. This is the relevant section of the log:
usb_control/bulk_msg: timeout usb_control/bulk_msg: timeout usb_control/bulk_msg: timeout
The device timed out a control or bulk message ... which could be real, but I also don't trust that particular code (since it can/does give the "raced timeout" messages with quick EHCI turnaround, as well as just looking dubious).
Are you sure nothing interesting happened earlier? I recognize that usb-storage doesn't normally tell you when things go strange ... and storage debug messages give so much data that they change i/o timings in significant ways, hiding bugs.
I translate this sequence as a fault recovery problem, because the last times I've investigated, usb-storage will not use that odd usbcore code otherwise.
Clearly the fault recovery code should not oops. It's not clear from what you've said what the fault is; or whether it's avoidable.
hub.c: port 1, portstatus 503, change 10, 480 Mb/s ehci-hcd 00:09.2: devpath 2.1 ep0out 3strikes hub.c: USB device not accepting new address (error=-71)
Portstatus 0x0503 == high speed, powered, enabled, connection change 0x0010 == reset completed
... then the set_address (on ep0out) failed. All of that is fault recovery code failing.
As a rule, I think the 2.5 fault recovery logic is more robust than the 2.4 stuff, but it tends not to get a lot of testing. So secondary (and tertiary, etc) failures can sometimes get messy:
usb-storage: host_reset() requested but not implemented
scsi: device set offline - command error recover failed: host 2 channel 0 id 0 lun 0
SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 6070000
I/O error: dev 08:01, sector 11917048
I/O error: dev 08:01, sector 11917056
I/O error: dev 08:01, sector 11917296
I/O error: dev 08:01, sector 37888
journal-601, buffer write failed
kernel BUG at prints.c:334!
invalid operand: 0000
CPU: 0
EIP: 0010:[<f0972879>] Tainted: PF
EFLAGS: 00010286
eax: 00000024 ebx: f09867a0 ecx: ec77c000 edx: 00000001
esi: e8a3c400 edi: 00000000 ebp: e8a3c400 esp: c185ded8
ds: 0018 es: 0018 ss: 0018
Process kupdated (pid: 6, stackpage=c185d000)
Stack: f0984c3a f0988920 f09867a0 c185defc f5064dd4 00000003 f097ce5f e8a3c400 f09867a0 0000002c 00000012 00000010 00000000 f5064e08 f5064dfc 00000004 00000000 0000002d d12ec6c0 f09807ce e8a3c400 f5064dd4 00000001 c185df98 Call Trace: [<f0984c3a>] [<f0988920>] [<f09867a0>] [<f097ce5f>] [<f09867a0>]
[<f09807ce>] [<f097fa1f>] [<f098776f>] [<f096feb5>] [<c0137151>] [<c01366be>]
[<c0136945>] [<c01055e8>]
Code: 0f 0b 4e 01 40 4c 98 f0 68 20 89 98 f0 85 f6 74 16 0f b7 46 I/O error: dev 08:01, sector 11917048
Ksymoops info would help. Rule of thumb, don't ever send a stack trace without the relevant symbols already decoded.
I see this with kernels 2.4.21-rc2 and rc6 just the same. 2.5.70 is even worse, it just stalls the access (very early on, not after several 100MB) without any log messages, and CPU load diverges without any useful information showing up with "top". It happens only with the EHCI driver, in full-speed mode I haven't yet been able to produce this error (maybe due to the relaxed timing).
Well that 2.5.70 failure mode is curious...
Checking the ehci "async" and "registers" files (in sysfs) could be useful. The last time I saw a failure anything like that, the issue was a deadlock inside storage+scsi, since the EHCI driver had handed all requests back ("async" was empty) and Alt-SysRq-T showed usb-storageN and scsi-ehN wedged.
But I've also seen failures suggesting that some EHCI silicon is reading I/O descriptors (qTDs) after they're marked as done (and the HCD freed them). If that's what you're seeing, and you enable CONFIG_SLAB_DEBUG, you'll see 0xa7a7a7a7 style bogus addresses in the head of a hardware queue.
I can write to the disk for ages without any problems, it's the reading that causes me headaches. I do notice, however, that when I mount the disk read-only (such that the access time isn't updated after every read), the problem occurs much less often. It seems to me that a simultaneous read and write can provoke this fault with relatively high probability.
Curious. This is just one disk? Whose EHCI silicon? Some of the VIA hardware seems to really dislike the code paths which remove idle entries from the EHCI async schedule ... both VT8235 (southbridge) and VT6202 (discrete) have had failure modes in those areas. (Hence IAA and I/O watchdogs.)
Turning on usb-storage logging isn't much use, I haven't seen any timeout/oops with it enabled, probably because it changes the timings.
I don't think it's a hardware or thermal problem, given the symptoms. Has anyone seen anything like this? Any help would be appreciated, and a quick fix too!
I'd not be so quick to rule out hardware or thermal issues, given that you're driving the hardware so much faster.
- Dave
Andras
------------------------------------------------------- This SF.net email is sponsored by: eBay Get office equipment for less on eBay! http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 _______________________________________________ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel