This message is from the T13 list server.

This is a little long and fairly technical...

In an offline email discussion I was told R/W Long was needed by
companies that buy disk drives because these companies don't
trust the disk drive manufacturers and R/W Long is the only way
these disk drive buyers have to determine "error rates".  All
that sounded a little strange to me.  So I checked with my expert
friends, people that actually implement disk drive hardware and
firmware (I do implement drive firmware but mostly on the
interface side of a device).

First lets talk about how a disk drive works...

Most drives have a PRML read channel.  This channel will pump out
a string of bits that are the best quess at what the analog read
data for a sector might represent.  A PRML read channel is able
to make quesses at the data because of the data encoding.  For
example, a PRML read channel might know that the analog data can
not represent three zeroes in a row...  One of those zero bits
must be a one bit.  No one I have talked to considers the
decoding of the analog data by the PRML channel to be a error
correction.  In high performance high capacity drive the PRML
channel may be making many guesses each time a sector is read.
Some of these guesses may be wrong.

(One comment that caused me to generate this message was:  When a
PRML read channel must guess at the data decoding this is
considered a "soft error".)

If you are lucky a drive implementing Read Long will return to
the host the bit string produced by the PRML channel.  However
not all drives do that, some drives may apply some form of ECC
correction to the data even when the data is sent to the host via
Read Long.

A PRML read channel can also detect when there is "missing data",
data that just can't be read at all.  This can happen for a
number of reasons:  a physical flaw in the media being one.  In
these cases the PRML channel will normally have error offset and
error span information that can be used by the ECC correction or
by the firmware.  If there is enough missing data and the ECC can
not reconstruct it then the drive would normally reread the
sector in hopes of getting error free data or data that can be
corrected.

(OK, a reread that results in good data for a sector is probably
considered by most people to be a "soft error".)

If you write some data pattern into a sector and then use Read
Long to read it you may be able to see in the sector's data and
ECC where the PRML has guessed wrong or where there is missing
data.  You should probably do several reads just to make just the
information you are seeing is stable (and not external noise
randomly affecting the read channel).  In normal disk drive
manufacturing such a scheme might be used to evaluate a drive.
Of course this is a really slow way to do this evaluation.  There
are other ways to do this that are faster (and very proprietary).

Next there is the ECC correction hardware...  This hardware is
extremely complex.  Most drives have 2 or 4 or more correcters
running in parallel.  The data+ECC bytes of the sectors are split
into "columns" and "rows".  Each correcter works on its column or
row of the data.  Each correcter may be able to fix 2, 3, 4, or
more, bad symbols.  Usually a symbol corresponds to a byte of
data or ECC.  The ECC correction most likely uses the error
offset and error span information from the PRML read channel.
Frequently the recorded sector also includes a CRC computed over
all the data and the ECC.  This CRC can be used as a final check
that the correction was done correctly.  Some drives may, if able
and if needed, run the correction sequence over the sector more
than once.  Like I said, today's ECC algorithms are very complex.

(Another reason for this message:  Someone said in a message here
that it was unlikely that a PRML read channel would provide
information like the error offset and error span to the ECC.)

This brings us to the question of what is a "soft error"?  When
does a correction process for a sector become something more than
a "soft error"?  If the ECC must correct several bits in each
sector, not because there is a media flaw but because the PRML
read channel made bad guesses, is that a "soft error"?

(The common answer is 'no'.)

Now back to testing a drive's ECC with R/W Long.  As you can see,
if you are going to use R/W Long to test a drives ECC then you
need to know a number of things about the drive, for example:  a)
does Read Long return the raw uncorrected output of the PRML read
channel? b) if Read Long returns all of the ECC data? c) does the
ECC include a CRC and is the CRC also returned? c) And of course
you need to understand what kind of error bursts the ECC can
correct.

And then when you actually run your ECC test you need to find a
sector, and that probably needs to be a sector in at least each
zone of the drive (do we also need a discussion of zones too?),
that has no changing bits, that is a sector that can be read with
Read Long say 10 times and you get the same data back each time.
This is a sector that might be OK to use for more complex ECC
testing.

Lets say you want to test only correctable error conditions...
Do you have the necessary information (I'm very sure it will be
proprietary information) from the device design engineers to do
such a test?

Lets say you want to test only uncorrectable error conditions...
Do you know what it takes to produce a valid uncorrectable error?
Yea, I guess you could just corrupt all the data bytes when you
do the Write Long.

Now finally...

Now how does R/W Long help anyone determine a drive's "soft
error" rate?

And, if you are using R/W Long to test a drive's ECC
implementation (in a customer enviroment) what are you really
trying to determine?  Does you ECC test software understand the
ECC algorithms implemented by the drive you are trying to test?
If not, how can you do a valid test?

(Let the fun begin...)



*** Hale Landis *** www.ata-atapi.com ***



Reply via email to