This message is from the T13 list server.
Harlan, The difficulty is that ECC algorithms today are some of the best simulated and tested portions of a drive. They also have very well described mathematical properties that make modeling them straightforward. By definition they take in a digital input stream, after the pre-amp/read channel has processed the raw analog stream. You're correct that the incoming distribution of possible error sequences affect the correction capabilities of the ECC in practice. However, the assumptions used to design the ECC are generally quite conservative, and any input stream that can be imagined can be easily checked by just applying the algorithm. Indeed, I know that these sorts of questions occasionally come up in design reviews with customers. And the correction codes themselves have the usual specifications for the simpler error sequences (e.g. lengths of 1 or 2 burst of errors), often backed up with mathematical proof in addition to testing. I think Hale and I (and I believe you as well) simply propose that of all the elements of drive design today, one of the least prone to failure is the ECC itself. If for no other reason than its digital and mathematical nature allows for much more reliable design, testing, and simulation than the more analog elements of the drive. We do a good job with analog design as well, but the job is a lot more complicated than in the digital domain. And the normal uses people put the READ/WRITE LONG commands to (i.e. corrupting bits and seeing of the ECC works) are directly checking the digital ECC, not really the analog elements of the drive. It's just not a very good tool for that purpose. For companies interested in quality products it's usually much more important to review and understand the manufacturing process (as you pointed out). Indeed, many customers do exactly that - and go the step further in qualifying the suppliers to those processes as well. Jim PS once again, all ATA did was make the command obsolete, so if customers continue to generate enough demand people can supply it. All T13 did was to make this a topic of discussion between the supplier and the customer, not a matter for standards compliance. -----Original Message----- From: Harlan Andrews [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 19, 2002 11:15 AM To: ata reflector Subject: [t13] Re: R/W (Long) Commands This message is from the T13 list server. On 3/18/02, Jim McGrath and Hale Landis both sent long eMails explaining theory about how ECC works on hard disk drives. The eMails were very well written and explain the "theory" quite well. However, what was not detailed, is that "theory" is not always correctly implemented. That is why we need testing. The "theory" of ECC correction depends upon the bit error rate of the "Head/Media/Channel". The predictions about the probability of Mis-Correction depend very much upon the raw error rate of the data being corrected. Drive suppliers have been VERY reluctant to allow the raw error rate of a drive to be measured by their customers. The "theory" is that the customers don't "need to know". However, the manufacturing division of the drive companies are under constant pressure to improve yield, reduce test time and reduce cost. To achieve these ends the temptation is to push testing back to the component suppliers and to allow looser tolerances on components. Sometimes the design groups do not realize when the actual raw error rate has been increased. The other problem is that there can be environmental contributions to the error rate. For example, if the drive is not well shielded then EMI can increase error rate. If the servo is not "stiff" enough, vibration can effect error rate. Thus, even if the manufacturer "knows" the raw error rate, they may not realize when that rate increases the probability of mis-correction to an unacceptable level. My R/W Long tests to not attempt to measure the raw error rate. They do not attempt to verify scientifically the ECC algorithms. They only attempt to find out if there is any margin in the probability of mis-correction. These tests were not developed because of "theory". They were developed because of REAL problems which were undetected by the drive manufacturers and resulted in REAL data corruption. As the media capacity increases, more and more demands are placed upon Error Correcting Codes. We should ALL be working to improve quality since NO ONE wants data corruption. Please do not take away what limited testing is available at the end user level. We should actually be looking for ways to IMPROVE the "in system" testability. ...Harlan
