This message is from the T13 list server.
This is a little long and fairly technical... In an offline email discussion I was told R/W Long was needed by companies that buy disk drives because these companies don't trust the disk drive manufacturers and R/W Long is the only way these disk drive buyers have to determine "error rates". All that sounded a little strange to me. So I checked with my expert friends, people that actually implement disk drive hardware and firmware (I do implement drive firmware but mostly on the interface side of a device). First lets talk about how a disk drive works... Most drives have a PRML read channel. This channel will pump out a string of bits that are the best quess at what the analog read data for a sector might represent. A PRML read channel is able to make quesses at the data because of the data encoding. For example, a PRML read channel might know that the analog data can not represent three zeroes in a row... One of those zero bits must be a one bit. No one I have talked to considers the decoding of the analog data by the PRML channel to be a error correction. In high performance high capacity drive the PRML channel may be making many guesses each time a sector is read. Some of these guesses may be wrong. (One comment that caused me to generate this message was: When a PRML read channel must guess at the data decoding this is considered a "soft error".) If you are lucky a drive implementing Read Long will return to the host the bit string produced by the PRML channel. However not all drives do that, some drives may apply some form of ECC correction to the data even when the data is sent to the host via Read Long. A PRML read channel can also detect when there is "missing data", data that just can't be read at all. This can happen for a number of reasons: a physical flaw in the media being one. In these cases the PRML channel will normally have error offset and error span information that can be used by the ECC correction or by the firmware. If there is enough missing data and the ECC can not reconstruct it then the drive would normally reread the sector in hopes of getting error free data or data that can be corrected. (OK, a reread that results in good data for a sector is probably considered by most people to be a "soft error".) If you write some data pattern into a sector and then use Read Long to read it you may be able to see in the sector's data and ECC where the PRML has guessed wrong or where there is missing data. You should probably do several reads just to make just the information you are seeing is stable (and not external noise randomly affecting the read channel). In normal disk drive manufacturing such a scheme might be used to evaluate a drive. Of course this is a really slow way to do this evaluation. There are other ways to do this that are faster (and very proprietary). Next there is the ECC correction hardware... This hardware is extremely complex. Most drives have 2 or 4 or more correcters running in parallel. The data+ECC bytes of the sectors are split into "columns" and "rows". Each correcter works on its column or row of the data. Each correcter may be able to fix 2, 3, 4, or more, bad symbols. Usually a symbol corresponds to a byte of data or ECC. The ECC correction most likely uses the error offset and error span information from the PRML read channel. Frequently the recorded sector also includes a CRC computed over all the data and the ECC. This CRC can be used as a final check that the correction was done correctly. Some drives may, if able and if needed, run the correction sequence over the sector more than once. Like I said, today's ECC algorithms are very complex. (Another reason for this message: Someone said in a message here that it was unlikely that a PRML read channel would provide information like the error offset and error span to the ECC.) This brings us to the question of what is a "soft error"? When does a correction process for a sector become something more than a "soft error"? If the ECC must correct several bits in each sector, not because there is a media flaw but because the PRML read channel made bad guesses, is that a "soft error"? (The common answer is 'no'.) Now back to testing a drive's ECC with R/W Long. As you can see, if you are going to use R/W Long to test a drives ECC then you need to know a number of things about the drive, for example: a) does Read Long return the raw uncorrected output of the PRML read channel? b) if Read Long returns all of the ECC data? c) does the ECC include a CRC and is the CRC also returned? c) And of course you need to understand what kind of error bursts the ECC can correct. And then when you actually run your ECC test you need to find a sector, and that probably needs to be a sector in at least each zone of the drive (do we also need a discussion of zones too?), that has no changing bits, that is a sector that can be read with Read Long say 10 times and you get the same data back each time. This is a sector that might be OK to use for more complex ECC testing. Lets say you want to test only correctable error conditions... Do you have the necessary information (I'm very sure it will be proprietary information) from the device design engineers to do such a test? Lets say you want to test only uncorrectable error conditions... Do you know what it takes to produce a valid uncorrectable error? Yea, I guess you could just corrupt all the data bytes when you do the Write Long. Now finally... Now how does R/W Long help anyone determine a drive's "soft error" rate? And, if you are using R/W Long to test a drive's ECC implementation (in a customer enviroment) what are you really trying to determine? Does you ECC test software understand the ECC algorithms implemented by the drive you are trying to test? If not, how can you do a valid test? (Let the fun begin...) *** Hale Landis *** www.ata-atapi.com ***
