Re: [usb-storage] Re: [Linux-usb-users] Data phase error not solved in 2.6.9-final

Pat LaVarre Fri, 29 Oct 2004 13:11:33 -0700

// Everyone:

I still don't have any "good" ideas ... we've got a long haul ahead of us if we only solve this as I know how ... would be goodness to get Genesys to tell us what they think those SK ASC ASCQ mean ...

// Andre S:

Pat, do you need more lines before the error occured?

I'm happy - you've found for me exactly the lines that interested me - you've got my vote for adding this filter to the readily available configurations of usb-storage debug.

Oct 29 15:17:36 titanic kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Oct 29 15:17:36 titanic kernel: Current sda: sense key Hardware Error Oct 29 15:17:36 titanic kernel: Additional sense: Data phase error Oct 29 15:17:36 titanic kernel: end_request: I/O error, dev sda, sector 53190608

Good. That decimal 53,190,608 = x 32B:9FD0 = x 03 2b 9f d0, i.e., correlates perfectly with:

Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9e 10 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9e 50 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9e 90 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9e d0 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9f 10 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9f 50 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9f 90 00 00 40 00 ... Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0xcd5f9 L 32768 F 128 Trg 0 LUN 0 CL 10 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0xcd5f9 R 0 Stat 0x0 Oct 29 15:17:36 titanic kernel: usb-storage: Command READ_10 (10 bytes) Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9f d0 00 00 40 00 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0xcd5fa L 32768 F 128 Trg 0 LUN 0 CL 10 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0xcd5fa R 0 Stat 0x1 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0x800cd5fa L 18 F 128 Trg 0 LUN 0 CL 6 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0x800cd5fa R 0 Stat 0x0 Oct 29 15:17:36 titanic kernel: usb-storage: Command READ_10 (10 bytes) Oct 29 15:17:36 titanic kernel: usb-storage: 28 00 03 2b 9f d8 00 00 38 00 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0xcd5fb L 28672 F 128 Trg 0 LUN 0 CL 10 Oct 29 15:17:36 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0xcd5fb R 0 Stat 0x0 ...

Sorry to say, our faith in Linux now looks concretely justified - so still we need a new theory to explain the "Data phase error".

Specifically, these traces appear legit: CBW and CDB in agreement. CDB x40 blocks * implicit 512 bytes per block = 64 * 0.5 Ki = 32 Ki = L 32768 of CBW. Also CBW F 128 = In = CDB x28 "READ". You'll notice Linux broke this pattern after the first CL 6 weirdness: it then switched into CDB x38 blocks = L 28672 of CBW, as if it had chosen to trust the first 8 blocks of data it collected during that first failing read at x32B9FD0.

See the repeating x 1 5 9 D pattern in the x 03 2b 9e 10 .. x 03 2b 9f 90 addresses? That tells us you've persuaded Linux to fetch from a continuous sequence of addresses, x40 at a time. x40 added to x10 modulo x100 is x 10 50 90 D0 10 50 90 D0 forever.

(I trust you'll tell me if I'm not yet making sense to you.)

Working from memory, I believe a similar pldd command is:

pldd -v if=/dev/sg9 bs=32768 sbs=512 skip=x32B9E10 count=9 >/dev/null

This command is slightly more complex than you saw before because in this simulation we have to recreate the rudeness of misalignment. The pldd way of doing that is to introduce the sbs parameter that dd does not define. Patch the pldd source to drop the timeout to 30s and drop the count= from the command line, and you should find you have reproduced this traffic at the lower, more determinate level.

To arrive at our new theory to explain the "Data phase error", I'd like to know if misalignment is significant, i.e., do we provoke an error or timeout no matter whether we sbs= and skip= to misalign or not. I suspect the misalignment is not significant - Linux just hasn't yet made a practice of valuing alignment - we will see a random mix of error and timeout no matter whether we align the I/O or not. But I can't be sure unless we try.

I'd also like to know if repeatedly reading just the region of the disk recreates the trouble, or if rereading a dozen megabytes that contain this spot recreates the trouble, or if we have to read the whole disk repeatedly to see trouble. And how varied is the location of trouble.

We might also want to try looking for explanatory patterns with the pldd options for triggering when any particular read is slow - I trust you'll tell me if I never got around to posting that revision of the code on to the web. :)

Pat LaVarre

-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

Re: [usb-storage] Re: [Linux-usb-users] Data phase error not solved in 2.6.9-final

Reply via email to