This message is from the T13 list server.


Harlan,

The difficulty is that ECC algorithms today are some of the best simulated
and tested portions of a drive.  They also have very well described
mathematical properties that make modeling them straightforward.  By
definition they take in a digital input stream, after the pre-amp/read
channel has processed the raw analog stream.

You're correct that the incoming distribution of possible error sequences
affect the correction capabilities of the ECC in practice.  However, the
assumptions used to design the ECC are generally quite conservative, and any
input stream that can be imagined can be easily checked by just applying the
algorithm.  Indeed, I know that these sorts of questions occasionally come
up in design reviews with customers.  And the correction codes themselves
have the usual specifications for the simpler error sequences (e.g. lengths
of 1 or 2 burst of errors), often backed up with mathematical proof in
addition to testing.

I think Hale and I (and I believe you as well) simply propose that of all
the elements of drive design today, one of the least prone to failure is the
ECC itself.  If for no other reason than its digital and mathematical nature
allows for much more reliable design, testing, and simulation than the more
analog elements of the drive.

We do a good job with analog design as well, but the job is a lot more
complicated than in the digital domain.  And the normal uses people put the
READ/WRITE LONG commands to (i.e. corrupting bits and seeing of the ECC
works) are directly checking the digital ECC, not really the analog elements
of the drive.  It's just not a very good tool for that purpose.

For companies interested in quality products it's usually much more
important to review and understand the manufacturing process (as you pointed
out).  Indeed, many customers do exactly that - and go the step further in
qualifying the suppliers to those processes as well.

Jim

PS once again, all ATA did was make the command obsolete, so if customers
continue to generate enough demand people can supply it.  All T13 did was to
make this a topic of discussion between the supplier and the customer, not a
matter for standards compliance.


-----Original Message-----
From: Harlan Andrews [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 19, 2002 11:15 AM
To: ata reflector
Subject: [t13] Re: R/W (Long) Commands


This message is from the T13 list server.


On 3/18/02, Jim McGrath and Hale Landis both sent long eMails explaining 
theory about how ECC works on hard disk drives.

The eMails were very well written and explain the "theory" quite well.  
However, what was not detailed, is that "theory" is not always correctly 
implemented.  That is why we need testing.  

The "theory" of ECC correction depends upon the bit error rate of the 
"Head/Media/Channel".  The predictions about the probability of 
Mis-Correction depend very much upon the raw error rate of the data being 
corrected.

Drive suppliers have been VERY reluctant to allow the raw error rate of a 
drive to be measured by their customers.  The "theory" is that the 
customers don't "need to know".   However, the manufacturing division of 
the drive companies are under constant pressure to improve yield, reduce 
test time and reduce cost.   To achieve these ends the temptation is to 
push testing back to the component suppliers and to allow looser 
tolerances on components.   Sometimes the design groups do not realize 
when the actual raw error rate has been increased.

The other problem is that there can be environmental contributions to the 
error rate.  For example, if the drive is not well shielded then EMI can 
increase error rate.  If the servo is not "stiff" enough, vibration can 
effect error rate.   Thus, even if the manufacturer "knows" the raw error 
rate, they may not realize when that rate increases the probability of 
mis-correction to an unacceptable level.

My R/W Long tests to not attempt to measure the raw error rate.   They do 
not attempt to verify scientifically the ECC algorithms.   They only 
attempt to find out if there is any margin in the probability of 
mis-correction.

These tests were not developed because of "theory".   They were developed 
because of REAL problems which were undetected by the drive manufacturers 
and resulted in REAL data corruption.   

As the media capacity increases, more and more demands are placed upon 
Error Correcting Codes.   We should ALL be working to improve quality 
since NO ONE wants data corruption.  Please do not take away what limited 
testing is available at the end user level.  We should actually be 
looking for ways to IMPROVE the "in system" testability.

...Harlan



Reply via email to