Re: [linux-usb-devel] USB storage timeout and oops

David Brownell Mon, 02 Jun 2003 04:26:11 -0700

Major A wrote:

Hi,

I can see in the archives that a problem with the USB storage driver
has been discussed recently. I just got a USB 2.0 hard drive enclosure
(made by a company called Anydisk), and it works out of the box with
the usb-storage driver off all USB hosts I have tried so far (EHCI on
a VIA chipset, UHCI on an Intel PIIX4). Power supply may be a problem,
but I use a dedicated external supply for the disk.

I do notice, however, that sometimes during heavy use (copying files,
calculating MD5 checksums, even just reading files in chunks of 60MB
with several minutes of pauses) causes a timeout, and directly
afterwards an oops. This is the relevant section of the log:

usb_control/bulk_msg: timeout
usb_control/bulk_msg: timeout
usb_control/bulk_msg: timeout


The device timed out a control or bulk message ... which
could be real, but I also don't trust that particular code
(since it can/does give the "raced timeout" messages with
quick EHCI turnaround, as well as just looking dubious).

Are you sure nothing interesting happened earlier?  I
recognize that usb-storage doesn't normally tell you
when things go strange ... and storage debug messages
give so much data that they change i/o timings in
significant ways, hiding bugs.

I translate this sequence as a fault recovery problem,
because the last times I've investigated, usb-storage will
not use that odd usbcore code otherwise.

Clearly the fault recovery code should not oops.  It's
not clear from what you've said what the fault is; or
whether it's avoidable.

hub.c: port 1, portstatus 503, change 10, 480 Mb/s
ehci-hcd 00:09.2: devpath 2.1 ep0out 3strikes
hub.c: USB device not accepting new address (error=-71)


Portstatus 0x0503 == high speed, powered, enabled, connection
change 0x0010 == reset completed

... then the set_address (on ep0out) failed.  All of
that is fault recovery code failing.

As a rule, I think the 2.5 fault recovery logic is more
robust than the 2.4 stuff, but it tends not to get a lot
of testing.  So secondary (and tertiary, etc) failures
can sometimes get messy:

usb-storage: host_reset() requested but not implemented scsi: device set offline - command error recover failed: host 2 channel 0 id 0 lun 0 SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 6070000 I/O error: dev 08:01, sector 11917048 I/O error: dev 08:01, sector 11917056 I/O error: dev 08:01, sector 11917296 I/O error: dev 08:01, sector 37888 journal-601, buffer write failed kernel BUG at prints.c:334! invalid operand: 0000 CPU: 0 EIP: 0010:[<f0972879>] Tainted: PF EFLAGS: 00010286 eax: 00000024 ebx: f09867a0 ecx: ec77c000 edx: 00000001 esi: e8a3c400 edi: 00000000 ebp: e8a3c400 esp: c185ded8 ds: 0018 es: 0018 ss: 0018 Process kupdated (pid: 6, stackpage=c185d000) Stack: f0984c3a f0988920 f09867a0 c185defc f5064dd4 00000003 f097ce5f e8a3c400 f09867a0 0000002c 00000012 00000010 00000000 f5064e08 f5064dfc 00000004 00000000 0000002d d12ec6c0 f09807ce e8a3c400 f5064dd4 00000001 c185df98 Call Trace: [<f0984c3a>] [<f0988920>] [<f09867a0>] [<f097ce5f>] [<f09867a0>] [<f09807ce>] [<f097fa1f>] [<f098776f>] [<f096feb5>] [<c0137151>] [<c01366be>] [<c0136945>] [<c01055e8>]

Code: 0f 0b 4e 01 40 4c 98 f0 68 20 89 98 f0 85 f6 74 16 0f b7 46 I/O error: dev 08:01, sector 11917048


Ksymoops info would help.  Rule of thumb, don't ever send
a stack trace without the relevant symbols already decoded.

I see this with kernels 2.4.21-rc2 and rc6 just the same. 2.5.70 is
even worse, it just stalls the access (very early on, not after
several 100MB) without any log messages, and CPU load diverges without
any useful information showing up with "top". It happens only with the
EHCI driver, in full-speed mode I haven't yet been able to produce
this error (maybe due to the relaxed timing).

Well that 2.5.70 failure mode is curious...

Checking the ehci "async" and "registers" files (in sysfs)
could be useful.  The last time I saw a failure anything like
that, the issue was a deadlock inside storage+scsi, since the
EHCI driver had handed all requests back ("async" was empty)
and Alt-SysRq-T showed usb-storageN and scsi-ehN wedged.

But I've also seen failures suggesting that some EHCI silicon
is reading I/O descriptors (qTDs) after they're marked as
done (and the HCD freed them).  If that's what you're seeing,
and you enable CONFIG_SLAB_DEBUG, you'll see 0xa7a7a7a7 style
bogus addresses in the head of a hardware queue.

I can write to the disk for ages without any problems, it's the
reading that causes me headaches. I do notice, however, that when I
mount the disk read-only (such that the access time isn't updated
after every read), the problem occurs much less often. It seems to me
that a simultaneous read and write can provoke this fault with
relatively high probability.


Curious.  This is just one disk?  Whose EHCI silicon?  Some
of the VIA hardware seems to really dislike the code paths
which remove idle entries from the EHCI async schedule ...
both VT8235 (southbridge) and VT6202 (discrete) have had
failure modes in those areas.  (Hence IAA and I/O watchdogs.)

Turning on usb-storage logging isn't much use, I haven't seen any
timeout/oops with it enabled, probably because it changes the timings.

I don't think it's a hardware or thermal problem, given the
symptoms. Has anyone seen anything like this? Any help would be
appreciated, and a quick fix too!


I'd not be so quick to rule out hardware or thermal issues,
given that you're driving the hardware so much faster.

- Dave

Andras

-------------------------------------------------------
This SF.net email is sponsored by: eBay
Get office equipment for less on eBay!
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Re: [linux-usb-devel] USB storage timeout and oops

Reply via email to