This message is from the T13 list server.
For all of these (and many more) reasons desktop disk drive data sheets typically only specify the uncorrectable error rate, which is the same for all desktop drives (and has been for quite some time): 10 (block) errors in 10^15 bits read. 10^15 bits is read over 73 days of continuous reading at a transfer rate of 20 MB/s. In practice the duty cycle for typical PC usage is a lot lighter (by several orders of magnitude), but that's what you would get if you actually were testing for this. Note that this is NOT the rate at which you get bad data. It is the rate at which the drive tells you it cannot read the data. Sometimes retries at the system level, or changing the environmental conditions (e.g. cooling the drive) will change this and allow the data to be read. In practice a lot of these issues actually take place during the writing process, and are just being found out by reading (which is why some people verify writes immediately, especially for sensitive information). It's also a reason why people employ things like RAID technology. To get to that level a lot of really interesting strategies are employed, many of them trade secrets. Many of our customers do test to this error rate specification, and sometimes want to know more about what is going on under the hood. Note that the often cited "soft" or "correctable" error rates have no industry standardization, and cannot be independently verified at the ATA interface. The data you get back is always the same (correct) user data, the only difference being an error code the drive provides. So it's a pretty useless test of the drive (since only the drive can detect the problem to begin with). The only other interesting error rate "specification" is the miscorrection error rate. At some point the ECC is theoretically capable of signaling good data when the user data is actually bad. However, since this is normally at least 10 orders of magnitude rarer than an uncorrectable error (depending on the ECC employed), no one actually ever sees this in normal usage (the odds are probably better that the HDD will fail due to being struck by a meteor). Jim PS of course, all of these rates assume use within the specified limits. Try and run a drive at 90 degrees C (or submerged in salt water) and you will get a real mess a lot sooner. -----Original Message----- From: Hale Landis [mailto:[EMAIL PROTECTED]] Sent: Monday, March 18, 2002 3:56 PM To: T13 List Server Subject: [t13] R/W Long - technical with questions This message is from the T13 list server. This is a little long and fairly technical... In an offline email discussion I was told R/W Long was needed by companies that buy disk drives because these companies don't trust the disk drive manufacturers and R/W Long is the only way these disk drive buyers have to determine "error rates". All that sounded a little strange to me. So I checked with my expert friends, people that actually implement disk drive hardware and firmware (I do implement drive firmware but mostly on the interface side of a device). First lets talk about how a disk drive works... Most drives have a PRML read channel. This channel will pump out a string of bits that are the best quess at what the analog read data for a sector might represent. A PRML read channel is able to make quesses at the data because of the data encoding. For example, a PRML read channel might know that the analog data can not represent three zeroes in a row... One of those zero bits must be a one bit. No one I have talked to considers the decoding of the analog data by the PRML channel to be a error correction. In high performance high capacity drive the PRML channel may be making many guesses each time a sector is read. Some of these guesses may be wrong. (One comment that caused me to generate this message was: When a PRML read channel must guess at the data decoding this is considered a "soft error".) If you are lucky a drive implementing Read Long will return to the host the bit string produced by the PRML channel. However not all drives do that, some drives may apply some form of ECC correction to the data even when the data is sent to the host via Read Long. A PRML read channel can also detect when there is "missing data", data that just can't be read at all. This can happen for a number of reasons: a physical flaw in the media being one. In these cases the PRML channel will normally have error offset and error span information that can be used by the ECC correction or by the firmware. If there is enough missing data and the ECC can not reconstruct it then the drive would normally reread the sector in hopes of getting error free data or data that can be corrected. (OK, a reread that results in good data for a sector is probably considered by most people to be a "soft error".) If you write some data pattern into a sector and then use Read Long to read it you may be able to see in the sector's data and ECC where the PRML has guessed wrong or where there is missing data. You should probably do several reads just to make just the information you are seeing is stable (and not external noise randomly affecting the read channel). In normal disk drive manufacturing such a scheme might be used to evaluate a drive. Of course this is a really slow way to do this evaluation. There are other ways to do this that are faster (and very proprietary). Next there is the ECC correction hardware... This hardware is extremely complex. Most drives have 2 or 4 or more correcters running in parallel. The data+ECC bytes of the sectors are split into "columns" and "rows". Each correcter works on its column or row of the data. Each correcter may be able to fix 2, 3, 4, or more, bad symbols. Usually a symbol corresponds to a byte of data or ECC. The ECC correction most likely uses the error offset and error span information from the PRML read channel. Frequently the recorded sector also includes a CRC computed over all the data and the ECC. This CRC can be used as a final check that the correction was done correctly. Some drives may, if able and if needed, run the correction sequence over the sector more than once. Like I said, today's ECC algorithms are very complex. (Another reason for this message: Someone said in a message here that it was unlikely that a PRML read channel would provide information like the error offset and error span to the ECC.) This brings us to the question of what is a "soft error"? When does a correction process for a sector become something more than a "soft error"? If the ECC must correct several bits in each sector, not because there is a media flaw but because the PRML read channel made bad guesses, is that a "soft error"? (The common answer is 'no'.) Now back to testing a drive's ECC with R/W Long. As you can see, if you are going to use R/W Long to test a drives ECC then you need to know a number of things about the drive, for example: a) does Read Long return the raw uncorrected output of the PRML read channel? b) if Read Long returns all of the ECC data? c) does the ECC include a CRC and is the CRC also returned? c) And of course you need to understand what kind of error bursts the ECC can correct. And then when you actually run your ECC test you need to find a sector, and that probably needs to be a sector in at least each zone of the drive (do we also need a discussion of zones too?), that has no changing bits, that is a sector that can be read with Read Long say 10 times and you get the same data back each time. This is a sector that might be OK to use for more complex ECC testing. Lets say you want to test only correctable error conditions... Do you have the necessary information (I'm very sure it will be proprietary information) from the device design engineers to do such a test? Lets say you want to test only uncorrectable error conditions... Do you know what it takes to produce a valid uncorrectable error? Yea, I guess you could just corrupt all the data bytes when you do the Write Long. Now finally... Now how does R/W Long help anyone determine a drive's "soft error" rate? And, if you are using R/W Long to test a drive's ECC implementation (in a customer enviroment) what are you really trying to determine? Does you ECC test software understand the ECC algorithms implemented by the drive you are trying to test? If not, how can you do a valid test? (Let the fun begin...) *** Hale Landis *** www.ata-atapi.com ***
