David,
I had started my response is intermingled with your reply while I started taking action. I would appreciate it if you scanned through most of this, I have posed a couple of questions in various places. However we have major progress. Your suggestion pertaining to further restricting the mask seems to have done the trick. I did this in -bk8. I have not (yet) been able to toggle this on and off for demonstrating direct control as I am remotely logged in to the colleague's machine and doing some bulk work. Therefore I don't want to reboot. However I have every reason to believe that this is a 'surgical' fix.
I assume that this is disabling what would otherwise be 'a good thing (tm)'. I look forward to any elaboration you can provide. I am willing to continue to work the issue.
I have only two complete passes through the 11.5GB drive. However I would have normally logged about 1000 errors during those passes.
I will go to the point of trying a quick change back and forth of the mask. Is there anything to keep me from stashing a copy of the one module and switching from boot to boot? I seem to remember something about a new feature of module checksum but I think the default was not to use them. For that matter can I rmmod and subsequently modprobe to avoid full reboots?
Also, I have another NForce2 mobo, also ASUS but a different model. It exhibited the symptom too. I have yet to try it. For that matter, I have been working with just one of the two 5.25 FF enclosure exclusively. They are different brands and slightly different in appearance. I will try the other soon although it has the same ALi chip and, apparently, the same circuit board. I will get the 2.5, which is a different ALi chip going soon.
There are a couple of questions at the end of my reply. Also I would love to hear your response to the current state of my original item 7.
Dale
David Brownell wrote:
On Monday 13 December 2004 12:00 am, Dale Manny wrote:The other problems were strictly of my own doing. I think I have those issues under control. One thing that I will probably doing is expand my / filesystem on the main box that I am using for testing. I have not checked for sure but it appears that make modules_install doesn't really care if you exhaust the available space. The results are 'not good'. If you don't have your modules or all of them, all hell breaks loose. I have had to use another bootable partition in order to recover.
... I am now running 2.6.10-bk6 without distracting other issues. However my data corruption problem continues to exist.
You may well have multiple problems here.
I have done this more than once since my investigation into this USB issue started. I intend to verify the lack of error detection before all is done.
I am fighting an issue of silent corruption of data from certain USB ATA drive enclosures only on certain motherboards. ...
1) Everything otherwise working well, hotplug auto identification, etc. Only issue is sporadic corruption of data read from enclosure when used with any of the USB ports on specific motherboards and only with specific USB enclosures. Other than the bad data, there is _no_ indication of failure.
You say "corruption of data read". How do you know that the problem is on the read side? Can you talk to the raw block device (uncached) and verify that's the issue -- that the data is right on the disk, and on the wire, but sometimes reads wrong?
I originally placed the pattern on my current test drive, some 11.5GBytes with the Toshiba notebook. I computed the md5sum of the pattern 'on the fly' and confirmed that reading the raw /dev/sda yielded the same results. I then moved back to the NForce2 machine. By this time the NForce2 was running 2.6.9. My pattern generating program also doubles as its own verifier. First I verified that the nature of problem without ever writing to the drive. Multiple reads did not show the corruption to exist at the same place twice, let alone to contain the same particular unexpected content.
Eventually I wanted to know how much corruption would occur if I sourced the pattern onto the drive using the suspect system. I did so and the results were particularly interesting, at least to me. Even though I continue to get about 500 errored blocks over the 11.5Gbyte drive, these error results are never repeated for any one block from scan to scan. Consequently, an argument can be made that the pattern was emitted correctly and that it is just being read incorrectly. The possible fallacy there is that I did not bias the pattern to make it different than the one laid down on the Toshiba. Like many things, explaining is itself enlightening. I have been assuming that writes have been 100% successful on the strength of read results alone. I would not see a write that was missed entirely as the previous content was identical. I will soon correct the oversight.
As far as "the raw (uncached) block device", do you mean anything other than simply using /dev/sda?
Longer answer and a little history, may skip if necessary:
A further reason why I say that only the read data seems to be affected is as follows:
I was very pleased with my initial use of the first of these enclosures with a 120GByte drive. I began using it as an active ext2 filesystem, in two partitions. The primary use that I put this new space to was the extraction of video data from a Tivo via TCP utility programs. Because I am a suspicious bastard, I make the machines do double duty, extracting twice. I do md5sums 'on the fly' of this data. I have many gigabytes of this data in this filesystem awaiting editing and DVD production.
The weird thing is that this level of processing has been in place prior to the use of the external drives. I never had an issue with the data being processed in this way. This is why I will eventually try operations against a single partition within the drive.
My first indication of problem was co-incident with my purchase of the non-NForce2 system in play, the Toshiba notebook. Of course it came with WinXP in an all inclusive partition. Because I have had good results with bulk operations with the likes of dd on many different drives, expanding the Tivos, etc. I sought to backup the Toshiba's image. I used Knoppix to pump out the virgin image of the Toshiba prior to ever boot Windoze.
Most of the utilized space of the image was unfortunately in the midst of the single partition. I decided to use PartionMagic on a different Windoze box to rearrange this to my satisfaction. I tried to use the external enclosure to land the 60G image but try as I might, I could not verify the integrity of the copy. I finally gave up and moved the drive to an internal ATA connector. After this experience I curtailed use of the enclosure for my video processing, good results or not.
I have also been working through old drives, intending to mothball them with documentation as to the fact that they had had their contents archived. In this process I discovered that bulk results were unstable.
2) Only observed on ASUS NForce2 systems mobo USB ports. Have only two such mobo, both have problem but are different models.
If it's just reads, that could also be a symptom of memory corruption
or failures. Does memtest86 say your memory is OK? Does anything
other than USB show similar problems?
As you may have gathered, I don't think so but I keep memtest86 in grub and will schedule a run soon. It has been a while since I did it. I wish I had an ECC setup. I very grudgingly went away from having that functionality.
Of course the recent progress makes me think this is a less likely possibility. I would be interested to know if you still think this is of value. I will probably still do this, it is just a question of when.
Of course that is the system that works but here goes. BTW it is a Toshiba Satellite model A65-S126. I went with SuSE here because I found a post where someone else had excellent results with that distro on that exact model. A few rough edges but very usable.
3) Using a hub between motherboard and enclosure does not seem to affect problem.
4) Same cable/enclosure/disk combinations can be moved to Toshiba laptop without symptom being present. This is the only other USB 2.0 host interface readily available on site.
Whose EHCI silicon does the laptop have? ("lspci -v").
lspci -v 0000:00:00.0 Host bridge: ATI Technologies Inc: Unknown device cab3 (rev 05) Flags: bus master, 66Mhz, medium devsel, latency 64 Memory at b0000000 (32-bit, prefetchable) Memory at b4000000 (32-bit, prefetchable) [size=4K] Capabilities: [a0] AGP version 2.0
0000:00:01.0 PCI bridge: ATI Technologies Inc PCI Bridge [IGP 340M] (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, medium devsel, latency 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
I/O behind bridge: 0000c000-0000dfff
Memory behind bridge: e0000000-efffffff
Prefetchable memory behind bridge: a0000000-afffffff
Expansion ROM at 0000c000 [disabled] [size=8K]
0000:00:13.0 USB Controller: ATI Technologies Inc: Unknown device 4347 (rev 01) (prog-if 10 [OHCI])
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 5
Memory at f0001000 (32-bit, non-prefetchable)
0000:00:13.1 USB Controller: ATI Technologies Inc: Unknown device 4348 (rev 01) (prog-if 10 [OHCI])
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 5
Memory at f0002000 (32-bit, non-prefetchable)
0000:00:13.2 USB Controller: ATI Technologies Inc: Unknown device 4345 (rev 01) (prog-if 20 [EHCI])
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 5
Memory at f0003000 (32-bit, non-prefetchable)
Capabilities: [dc] Power Management version 2
0000:00:14.0 SMBus: ATI Technologies Inc: Unknown device 4353 (rev 18) Subsystem: Toshiba America Info Systems: Unknown device ff10 Flags: 66Mhz, medium devsel I/O ports at e000 Memory at f0000000 (32-bit, non-prefetchable) [size=1K]
0000:00:14.1 IDE interface: ATI Technologies Inc: Unknown device 4349 (prog-if 8a [Master SecP PriP])
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, medium devsel, latency 64, IRQ 11
I/O ports at <unassigned>
I/O ports at <unassigned>
I/O ports at <unassigned>
I/O ports at <unassigned>
I/O ports at 8070 [size=16]
0000:00:14.3 ISA bridge: ATI Technologies Inc: Unknown device 434c Subsystem: Toshiba America Info Systems: Unknown device ff10 Flags: bus master, 66Mhz, medium devsel, latency 0
0000:00:14.4 PCI bridge: ATI Technologies Inc: Unknown device 4342 (prog-if 01 [Subtractive decode])
Flags: bus master, 66Mhz, medium devsel, latency 64
Bus: primary=00, secondary=02, subordinate=02, sec-latency=32
I/O behind bridge: 0000a000-0000bfff
Memory behind bridge: d0000000-dfffffff
Prefetchable memory behind bridge: 90000000-9fffffff
0000:00:14.5 Multimedia audio controller: ATI Technologies Inc: Unknown device 4341
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, 66Mhz, slow devsel, latency 64, IRQ 10
Memory at f0000400 (32-bit, non-prefetchable)
0000:00:14.6 Modem: ATI Technologies Inc: Unknown device 434d (rev 01) (prog-if 00 [Generic])
Subsystem: Toshiba America Info Systems: Unknown device 0001
Flags: 66Mhz, slow devsel, IRQ 10
Memory at f0000500 (32-bit, non-prefetchable)
0000:01:05.0 VGA compatible controller: ATI Technologies Inc: Unknown device 4437 (prog-if 00 [VGA])
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, stepping, 66Mhz, medium devsel, latency 64, IRQ 11
Memory at a0000000 (32-bit, prefetchable)
I/O ports at c000 [size=256]
Memory at e0000000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [58] AGP version 2.0
Capabilities: [50] Power Management version 2
0000:02:04.0 Ethernet controller: Unknown device 168c:0013 (rev 01) Subsystem: Askey Computer Corp.: Unknown device 7064 Flags: bus master, medium devsel, latency 168, IRQ 10 Memory at d0010000 (32-bit, non-prefetchable) Capabilities: [44] Power Management version 2
0000:02:06.0 CardBus bridge: Texas Instruments PCI1410 PC card Cardbus Controller (rev 02)
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, medium devsel, latency 168, IRQ 5
Memory at 10000000 (32-bit, non-prefetchable)
Bus: primary=02, secondary=03, subordinate=06, sec-latency=176
Memory window 0: 10400000-107ff000 (prefetchable)
Memory window 1: 10800000-10bff000
I/O window 0: 00004000-000040ff
I/O window 1: 00004400-000044ff
16-bit legacy interface ports at 0001
0000:02:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
Subsystem: Toshiba America Info Systems: Unknown device ff10
Flags: bus master, medium devsel, latency 64, IRQ 11
I/O ports at a000
Memory at d0000000 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2
5) Present with all ATA disks from my collection that I have tried. Would estimate 8 different drives used.
6) Only seen on ALi based USB enclosures. I have tried two essentially identical 5.25 inch units and one 2.5 inch. The 5.25 are based on the ALi 5621 and the 2.5 inch on the ALi 5642. While these are different, they are probably related designs. These have the fastest total throughput of any devices I have available.
Interesting. The NForce2 EHCI has a "park" mode which can often
give it faster results, but which I've been suspecting may also
make trouble for some peripherals that can't handle data as fast
as the host tries to pass it. In ehci-hcd.c::ehci_start() there's
a mask "0x0fff" near a comment about irq latency; try changing
that to 0x00ff (disabling park mode).
As reported above, this seems effective, but at what cost?
Love to. I am comfortable with protocol analyzers in general but do not have any experience or acces to any USB diag equipment. If you have any suggestions toward gaining the use of one, I would be willing to consider them. I am in Kansas City and do not travel much. Rental would probably be prohibitive, overshadowing the cost of the hardware I am working with. It wouldn't take long to demonstrate the problem. I am open to suggestions.
7) Not seen when a borrowed Maxtor external firewire/USB drive was attached via USB to problematic NForce2 system. This unit is based on a Oxford Semi 991FW chip. This unit's transfer rate is just slightly lower than the ALi base ones. Also not seen when using Sandisk USB 2.0 flash drive.
I suppose it's probably too much to expect you to be able to just
capture the USB traffic (say with a CATC) and show what's happening
on the wire at the time the error is detected in memory ... ;)
That is always the game --getting independent verification. Is there any possible inexpensive approach roughly analogous to the 'promiscuous mode' of Ethernet adapters? The physical tapping of the signal would not be too hard.
8) Seen with several 2.6 kernels. Was running 2.6.8.1 when first encountered. Hi-speed USB did not work on this hardware prior to 2.6.8 due to interrupt issue. Have tried latest stable 2.6. 9 in hopes of clean operation but no change. (Now tried 2.6.10-rc3, no change.)
Or rc3-bk6, hmm.
Saw some activity in the BK8 patch set in the vicinity of drivers/usb/host/ehci_____.c. Got it . Made it. Tried it. Pretty much the same results but the lowest count to date for a full read of the 11.5GB test drive. It came in at 407 but this is not radically better. The range had previously been 430 to 608.
I missed any BK7.
9) First observed as lack of repeatability of md5sum of entire drive (/dev/sda) while archiving off old drive content. In order to better understand the nature of instability, I have written a simple program to generate a fixed test pattern and subsequently verify it.
10) Errors on read data only. Test pattern can be sourced on problem system and found to be valid on reading.
11) Rate of errors to total is seemingly consistent between runs -- about 0.002 %
But I'm unclear: can you verify the problem came on the read
side, once the host processed the data ... rather than appearing
on the wire? Or on the disk?
I hope the above clarifies this. Let me know if more info is needed.
12) Errors occur on 512 byte boundaries, widely and randomly separated. None seen in consecutive blocks - yet - but my feeling that this is just due to sparseness of problem.
512 is an interesting number because it's a single packet. Is usb-storage reporting errors of any kind?
Are the errors "all 512 bytes are wrong"?
How about "multiples of <cache line size> bytes are wrong?"
I had hoped that each block was in a single packet. But remember, I
13) Problems occur in seemingly random locations. Between all test runs where the block numbers of problem data are available, it has yet to happen twice in the same location.
14) In all observed cases where reading predictable content, the errors can be described as 'stale' or recently read data being seen again. Sometimes the total 512 byte content is recognizable as data that was previously read. Somewhat rarer is 2 different 256 byte chunks, both previously seen. Rarer still is data previously seen a with a few bytes of garbage.
The "few bytes of garbage" sounds like either some other kernel code trashing the buffer, or some issue at the level of bus or cache transactions colliding and behaving wrong.
I think the patterns of the other data errors should be informative.
I will provide the results of a full run. Perhaps that might be illuminating. I will be addressing some cosmetics and the aforementioned 'biasing' of the pattern.
I have done some searches on mailing list archives for Linux-usb-users, Linux-usb-devel and lkml archives but not seen anything that seems to be a good match. I have googled for similar reports without any luck -- too many false hits.
I wish to pursue this before it hurts someone. I am planning these next steps, in reverse order:
1) Thinking about getting one of the USB-PC to USB-PC active connection cables and seeing if I can reproduce problem via TCP/IP
The problem addressed by that previous patch was first uncovered
with such a cable, but as a rule the networking layer is a bit too
fault-tolerant to let a corrupted packet get in its way! ;)
I assume when you refer to 'that previous patch' that you might be referring to something that went by before I started monitoring either usb-users or -devel. I am tracking now. I grok layered network design. My feeling is that it is something that might be accomplished in the Linux kernel. This is just a gut feeling based on experience with other protocols.
My most recent feeling about getting one of these cables is that I don't know if it would prove anything. For starters, I almost certainly would have to have some sort of UDP type program. TCP would probably just deal with it. I do not know how the same error symptom would be expressed.
I fear that the speed issue might indeed mask the symptom. It is suspicious when the fastest devices are the only ones to exhibit the problem. This is why I mention the relative speed of the ALi vs. the Maxtor or the flash drive. I will be willing to continue to pursue even if added a PCI USB card allows the unit to work on those ports.
2) I have ordered a PCI USB 2.0 host adapter. I wanted one anyway for older box but will try on NForce2 boxes.
The southbridge EHCI versions seem to be faster than most of the add-on products. If speed is a factor, they might just dodge the issue that way.
New points/questions:
In my thinking about this problem, I have it in my mind that the motherboard BIOS really does not enter into the performance beyond the point of establishing interrupts, etc. Do you agree? The reason that I ask is that I may change CPUs and may need to upgrade the BIOS.
For future reference, I am in the US Central time zone but work at home. My response time is all over the clock but generally out from 1-7 AM.
Dale
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel