This message is from the T13 list server.


For all of these (and many more) reasons desktop disk drive data sheets
typically only specify the uncorrectable error rate, which is the same for
all desktop drives (and has been for quite some time): 10 (block) errors in
10^15 bits read.  10^15 bits is read over 73 days of continuous reading at a
transfer rate of 20 MB/s.  In practice the duty cycle for typical PC usage
is a lot lighter (by several orders of magnitude), but that's what you would
get if you actually were testing for this.

Note that this is NOT the rate at which you get bad data.  It is the rate at
which the drive tells you it cannot read the data.  Sometimes retries at the
system level, or changing the environmental conditions (e.g. cooling the
drive) will change this and allow the data to be read.  In practice a lot of
these issues actually take place during the writing process, and are just
being found out by reading (which is why some people verify writes
immediately, especially for sensitive information).  It's also a reason why
people employ things like RAID technology.

To get to that level a lot of really interesting strategies are employed,
many of them trade secrets.  Many of our customers do test to this error
rate specification, and sometimes want to know more about what is going on
under the hood.  

Note that the often cited "soft" or "correctable" error rates have no
industry standardization, and cannot be independently verified at the ATA
interface.  The data you get back is always the same (correct) user data,
the only difference being an error code the drive provides.  So it's a
pretty useless test of the drive (since only the drive can detect the
problem to begin with).

The only other interesting error rate "specification" is the miscorrection
error rate.  At some point the ECC is theoretically capable of signaling
good data when the user data is actually bad.  However, since this is
normally at least 10 orders of magnitude rarer than an uncorrectable error
(depending on the ECC employed), no one actually ever sees this in normal
usage (the odds are probably better that the HDD will fail due to being
struck by a meteor).

Jim

PS of course, all of these rates assume use within the specified limits.
Try and run a drive at 90 degrees C (or submerged in salt water) and you
will get a real mess a lot sooner.



-----Original Message-----
From: Hale Landis [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 18, 2002 3:56 PM
To: T13 List Server
Subject: [t13] R/W Long - technical with questions


This message is from the T13 list server.


This is a little long and fairly technical...

In an offline email discussion I was told R/W Long was needed by
companies that buy disk drives because these companies don't
trust the disk drive manufacturers and R/W Long is the only way
these disk drive buyers have to determine "error rates".  All
that sounded a little strange to me.  So I checked with my expert
friends, people that actually implement disk drive hardware and
firmware (I do implement drive firmware but mostly on the
interface side of a device).

First lets talk about how a disk drive works...

Most drives have a PRML read channel.  This channel will pump out
a string of bits that are the best quess at what the analog read
data for a sector might represent.  A PRML read channel is able
to make quesses at the data because of the data encoding.  For
example, a PRML read channel might know that the analog data can
not represent three zeroes in a row...  One of those zero bits
must be a one bit.  No one I have talked to considers the
decoding of the analog data by the PRML channel to be a error
correction.  In high performance high capacity drive the PRML
channel may be making many guesses each time a sector is read.
Some of these guesses may be wrong.

(One comment that caused me to generate this message was:  When a
PRML read channel must guess at the data decoding this is
considered a "soft error".)

If you are lucky a drive implementing Read Long will return to
the host the bit string produced by the PRML channel.  However
not all drives do that, some drives may apply some form of ECC
correction to the data even when the data is sent to the host via
Read Long.

A PRML read channel can also detect when there is "missing data",
data that just can't be read at all.  This can happen for a
number of reasons:  a physical flaw in the media being one.  In
these cases the PRML channel will normally have error offset and
error span information that can be used by the ECC correction or
by the firmware.  If there is enough missing data and the ECC can
not reconstruct it then the drive would normally reread the
sector in hopes of getting error free data or data that can be
corrected.

(OK, a reread that results in good data for a sector is probably
considered by most people to be a "soft error".)

If you write some data pattern into a sector and then use Read
Long to read it you may be able to see in the sector's data and
ECC where the PRML has guessed wrong or where there is missing
data.  You should probably do several reads just to make just the
information you are seeing is stable (and not external noise
randomly affecting the read channel).  In normal disk drive
manufacturing such a scheme might be used to evaluate a drive.
Of course this is a really slow way to do this evaluation.  There
are other ways to do this that are faster (and very proprietary).

Next there is the ECC correction hardware...  This hardware is
extremely complex.  Most drives have 2 or 4 or more correcters
running in parallel.  The data+ECC bytes of the sectors are split
into "columns" and "rows".  Each correcter works on its column or
row of the data.  Each correcter may be able to fix 2, 3, 4, or
more, bad symbols.  Usually a symbol corresponds to a byte of
data or ECC.  The ECC correction most likely uses the error
offset and error span information from the PRML read channel.
Frequently the recorded sector also includes a CRC computed over
all the data and the ECC.  This CRC can be used as a final check
that the correction was done correctly.  Some drives may, if able
and if needed, run the correction sequence over the sector more
than once.  Like I said, today's ECC algorithms are very complex.

(Another reason for this message:  Someone said in a message here
that it was unlikely that a PRML read channel would provide
information like the error offset and error span to the ECC.)

This brings us to the question of what is a "soft error"?  When
does a correction process for a sector become something more than
a "soft error"?  If the ECC must correct several bits in each
sector, not because there is a media flaw but because the PRML
read channel made bad guesses, is that a "soft error"?

(The common answer is 'no'.)

Now back to testing a drive's ECC with R/W Long.  As you can see,
if you are going to use R/W Long to test a drives ECC then you
need to know a number of things about the drive, for example:  a)
does Read Long return the raw uncorrected output of the PRML read
channel? b) if Read Long returns all of the ECC data? c) does the
ECC include a CRC and is the CRC also returned? c) And of course
you need to understand what kind of error bursts the ECC can
correct.

And then when you actually run your ECC test you need to find a
sector, and that probably needs to be a sector in at least each
zone of the drive (do we also need a discussion of zones too?),
that has no changing bits, that is a sector that can be read with
Read Long say 10 times and you get the same data back each time.
This is a sector that might be OK to use for more complex ECC
testing.

Lets say you want to test only correctable error conditions...
Do you have the necessary information (I'm very sure it will be
proprietary information) from the device design engineers to do
such a test?

Lets say you want to test only uncorrectable error conditions...
Do you know what it takes to produce a valid uncorrectable error?
Yea, I guess you could just corrupt all the data bytes when you
do the Write Long.

Now finally...

Now how does R/W Long help anyone determine a drive's "soft
error" rate?

And, if you are using R/W Long to test a drive's ECC
implementation (in a customer enviroment) what are you really
trying to determine?  Does you ECC test software understand the
ECC algorithms implemented by the drive you are trying to test?
If not, how can you do a valid test?

(Let the fun begin...)



*** Hale Landis *** www.ata-atapi.com ***


Reply via email to