Re: [linux-usb-devel] Silent corruption of data read from external ATA enclosures, ASUS NForce2 mobo connected to ALi USB chips

David Brownell Wed, 15 Dec 2004 06:00:05 -0800

On Monday 13 December 2004 12:00 am, Dale Manny wrote:
> ... I am now running 2.6.10-bk6 
> without distracting other issues. However my data corruption problem 
> continues to exist.


You may well have multiple problems here.


> >>>> I am fighting an issue of silent corruption of data from certain 
> >>>> USB ATA drive enclosures only on certain motherboards. ...
> >>>>
> >>>> 1) Everything otherwise working well, hotplug auto identification, 
> >>>> etc. Only issue is sporadic corruption of data read from enclosure 
> >>>> when used with any of the USB ports on specific motherboards and 
> >>>> only with specific USB enclosures. Other than the bad data, there 
> >>>> is _no_ indication of failure.

You say "corruption of data read".  How do you know that the
problem is on the read side?   Can you talk to the raw block
device (uncached) and verify that's the issue -- that the data
is right on the disk, and on the wire, but sometimes reads wrong?

> >>>> 2) Only observed on ASUS NForce2 systems mobo USB ports. Have only 
> >>>> two such mobo, both have problem but are different models.

If it's just reads, that could also be a symptom of memory corruption
or failures.  Does memtest86 say your memory is OK?  Does anything
other than USB show similar problems?

> >>>> 3) Using a hub between motherboard and enclosure does not seem to 
> >>>> affect problem.
> >>>>
> >>>> 4) Same cable/enclosure/disk combinations can be moved to Toshiba 
> >>>> laptop without symptom being present. This is the only other USB 
> >>>> 2.0 host interface readily available on site.

Whose EHCI silicon does the laptop have?  ("lspci -v").

> >>>> 5) Present with all ATA disks from my collection that I have tried. 
> >>>> Would estimate 8 different drives used.
> >>>>
> >>>> 6) Only seen on ALi based USB enclosures. I have tried two 
> >>>> essentially identical 5.25 inch units and one 2.5 inch. The 5.25 
> >>>> are based on the ALi 5621 and the 2.5 inch on the ALi 5642. While 
> >>>> these are different, they are probably related designs. These have 
> >>>> the fastest total throughput of any devices I have available.

Interesting.  The NForce2 EHCI has a "park" mode which can often
give it faster results, but which I've been suspecting may also
make trouble for some peripherals that can't handle data as fast
as the host tries to pass it.  In ehci-hcd.c::ehci_start() there's
a mask "0x0fff" near a comment about irq latency; try changing
that to 0x00ff (disabling park mode).

> >>>> 7) Not seen when a borrowed Maxtor external firewire/USB drive was 
> >>>> attached via USB to problematic NForce2 system. This unit is based 
> >>>> on a Oxford Semi 991FW chip. This unit's transfer rate is just 
> >>>> slightly lower than the ALi base ones. Also not seen when using 
> >>>> Sandisk USB 2.0 flash drive.

I suppose it's probably too much to expect you to be able to just
capture the USB traffic (say with a CATC) and show what's happening
on the wire at the time the error is detected in memory ... ;)

> >>>> 8) Seen with several 2.6 kernels. Was running 2.6.8.1 when first 
> >>>> encountered. Hi-speed USB did not work on this hardware prior to 
> >>>> 2.6.8 due to interrupt issue. Have tried latest stable 2.6. 9 in 
> >>>> hopes of clean operation but no change. (Now tried 2.6.10-rc3, no 
> >>>> change.)

Or rc3-bk6, hmm.

> >>>> 9) First observed as lack of repeatability of md5sum of entire 
> >>>> drive (/dev/sda) while archiving off old drive content. In order to 
> >>>> better understand the nature of instability, I have written a 
> >>>> simple program to generate a fixed test pattern and subsequently 
> >>>> verify it.
> >>>>
> >>>> 10) Errors on read data only. Test pattern can be sourced on 
> >>>> problem system and found to be valid on reading.
> >>>>
> >>>> 11) Rate of errors to total is seemingly consistent between runs -- 
> >>>> about 0.002 %

But I'm unclear:  can you verify the problem came on the read
side, once the host processed the data ... rather than appearing
on the wire?  Or on the disk?

> >>>> 12) Errors occur on 512 byte boundaries, widely and randomly 
> >>>> separated. None seen in consecutive blocks - yet - but my feeling 
> >>>> that this is just due to sparseness of problem.

512 is an interesting number because it's a single packet.
Is usb-storage reporting errors of any kind?

Are the errors "all 512 bytes are wrong"?

How about "multiples of <cache line size> bytes are wrong?"


> >>>> 13) Problems occur in seemingly random locations. Between all test 
> >>>> runs where the block numbers of problem data are available, it has 
> >>>> yet to happen twice in the same location.
> >>>>
> >>>> 14) In all observed cases where reading predictable content, the 
> >>>> errors can be described as 'stale' or recently read data being seen 
> >>>> again. Sometimes the total 512 byte content is recognizable as data 
> >>>> that was previously read. Somewhat rarer is 2 different 256 byte 
> >>>> chunks, both previously seen. Rarer still is data previously seen a 
> >>>> with a few bytes of garbage.

The "few bytes of garbage" sounds like either some other kernel
code trashing the buffer, or some issue at the level of bus or
cache transactions colliding and behaving wrong.

I think the patterns of the other data errors should be informative.


> >>>> I have done some searches on mailing list archives for 
> >>>> Linux-usb-users, Linux-usb-devel and lkml archives but not seen 
> >>>> anything that seems to be a good match. I have googled for similar 
> >>>> reports without any luck -- too many false hits.
> >>>>
> >>>>
> >>>> I wish to pursue this before it hurts someone. I am planning these 
> >>>> next steps, in reverse order:
> >>>>
> >>>> 1) Thinking about getting one of the USB-PC to USB-PC active 
> >>>> connection cables and seeing if I can reproduce problem via TCP/IP

The problem addressed by that previous patch was first uncovered
with such a cable, but as a rule the networking layer is a bit too
fault-tolerant to let a corrupted packet get in its way!  ;)


> >>>> 2) I have ordered a PCI USB 2.0 host adapter. I wanted one anyway 
> >>>> for older box but will try on NForce2 boxes.

The southbridge EHCI versions seem to be faster than most of
the add-on products.  If speed is a factor, they might just
dodge the issue that way.

- Dave



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Re: [linux-usb-devel] Silent corruption of data read from external ATA enclosures, ASUS NForce2 mobo connected to ALi USB chips

Reply via email to