Andres S:

I patched pldd: int minSeconds = 30;

Thank you.

Please remind us now how it was, with the default 28 hour pause, you came to think you had seen a hang and resorted to shut down and restart? That's the test we want to try again with a short timeout, to see if Linux will tell us something more useful if we say quit with a timeout rather than saying quit with a shut down. Quit via the SIGINT of Ctrl+C is another alternative, but I've never yet seen that third way be more useful in Linux, I've only seen it crash kernels.

date ; time /root/pldd/pldd if=$1 bs=32768 sbs=512 skip=x32B9E10 >/dev/null
...
x 28 00 08 47 A4 10 00 00 40 00 .. .. .. .. .. .. "(@HG$P@@@@"
x 70 00 04 00 00 00 00 0A 00 00 00 00 4B 00 00 00 "[EMAIL PROTECTED]@@@@J@@@@K@@@"
x 00 00 .. .. .. .. .. .. .. .. .. .. .. .. .. .. "@@"
43889983488 = 0xA380C0000 bytes copied, 43890016256 = 0xA380C8000 bytes tried ...
real 63m46.015s

Sense bytes [x2:C:D] & xF:FF:FF here = x 4 4B 00 = that data phase error, again - consistent, thank you.


To our list of intermittently failing addresses, we've added yet another apparently inconsistent sample, we now have: x 032B9FD0 0198EE50 05E833D0 0847A410. However, the x0847A410 we found only by starting at x32B9E10. Starting at x10 might have found something else.

We also see sense byte[0] & x80 Valid clear, and bytes [3:4:5:6] INFO zeroed. That makes me wonder why we saw Linux choose to retry another 8 blocks along. 8 blocks * 0.5 KiB/block = 4 KiB = an x86 physical page, I notice ... but I also notice this I/O is not aligned to page boundaries. (Hey Alan, you still with us? Know anything of where/ when Linux chooses to retry SCSI over USB commands?)

a) Would be good for us to explicitly retry only the failing CDB, i.e., command, i.e.,

date ; time /root/pldd/pldd if=$1 bs=32768 sbs=512 skip=0x0847A410 count=1 >/dev/null

b) Would be good to retry that repeatedly, e.g., 1000 times, or even 186 minutes i.e. three times as long as this took to trigger. Rather than slowly looping in bash, would be preferable to patch pldd to retry rather than advancing the address.

Retrying will tell us if a failure, once found, appears consistently, or instead needs the setup of a long stream of reads in front of it. Retrying rapidly in the same place would tell us if a long stream of reading the same place causes trouble as readily as moving the heads steadily along.

Mind you, to the usual caveats we should now add the rumour that we can damage a disk by telling it to concentrate in one area for hours, and then you need to decide whether you're willing to take on that risk or not.

Oct 30 15:21:23 titanic kernel: usb-storage: 28 00 08 47 a3 10 00 00 40 00 ...
Oct 30 15:21:23 titanic kernel: usb-storage: 28 00 08 47 a3 50 00 00 40 00 ...
Oct 30 15:21:23 titanic kernel: usb-storage: 28 00 08 47 a3 90 00 00 40 00 ...
Oct 30 15:21:23 titanic kernel: usb-storage: 28 00 08 47 a3 d0 00 00 40 00 ...

Thank you. At this lower, more determinate level, we have now reproduced the fetch of a sequence of x40 block chunks, misaligned.


Oct 30 15:21:23 titanic kernel: usb-storage: 28 00 08 47 a4 10 00 00 40 00
Oct 30 15:21:23 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0x14816b L 32768 F 128 Trg 0 LUN 0 CL 10
Oct 30 15:21:23 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0x14816b R 0 Stat 0x1
Oct 30 15:21:23 titanic kernel: usb-storage: Bulk Command S 0x43425355 T 0x8014816b L 18 F 128 Trg 0 LUN 0 CL 6
Oct 30 15:21:23 titanic kernel: usb-storage: Bulk Status S 0x53425355 T 0x8014816b R 0 Stat 0x0

Ah good - life ends here I presume, therefore I conclude - working at the lower, more determinate level, we've turned off the retries.


Do you need more runs of pldd maybe with other options?

I trip over how you chose to say this. Without help from Genesys, we can't be sure of finding a solution any time soon. The work we've done here makes it easier for Genesys folk to help now, but we can't hope for equally rapid success, so long as we work alone.


I stand by my claim: "I can only hope we'll learn more by digging more ... We know we could also try pldd/ sg_dd in Windows/ Linux with shortened/ default timeouts and lengthened/ default delays between commands. I can't promise we'll learn anything, I can only hope we'll hit on what works, or see some more explanatory consistency, like Windows failing as reliably as Linux when we ask that Windows reads the whole disk." Broken out into pieces, that English becomes, for example, the exploratory ideas (a) and (b) above and then also:

c) We could try pldd in Windows next.

d) We could try pldd again in Linux, but skip less initially:

date ; time /root/pldd/pldd if=$1 bs=32768 sbs=512 skip=0x10 >/dev/null

e) ...

I trust you'll tell me if/when I don't immediately appear to be making sense.

Pat LaVarre



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

Reply via email to