[t13] SMART Selective Self-test

Bruce Allen Tue, 06 Apr 2004 17:42:57 -0700

This message is from the T13 list server.


Dear T13,

Below are some comments concerning the 'Selective Self-Test' part of
ATA-7 Rev 4.  James Hatfield suggested that I circulate these to the
T13 reflector.

Context: I recently implemented Selective Self-Test features in an
open-source Linux/Solaris/FreeBSD/NetBSD/Windows SMART toolset that I
maintain. While working on this, I came across a number of points
where I thought that the Selective Self-test spec needed
clarification.

I realize that it's too late to incorporate any of these suggestions
into ATA-7, but wanted to circulate them for consideration for ATA-8.

Cheers,
        Bruce Allen

-------------------------------------------------------------------

6.53.6.8.5.2 Test span definition
           The Selective self-test log provides for the definition of
           up to five test spans. The starting LBA for each test span
           is the LBA of the first sector tested in the test span and
           the ending LBA for each test span is the last LBA tested in
           the test span. If the starting and ending LBA values for a
           test span are both zero, a test span is not defined and not
           tested. These values shall be written by the host and
           returned unmodified by the device.

* The spec should say if the LBA needs to be <= MAX LBA.  If not, what
  happens?

For example: Any LBA values within a test span which are greater than
the maximum value are not tested.

* The spec should also says what happens if the spans overlap.

Either: Each test span will be tested completely, even if the spans
overlap (I favor this)

Or alternatively: If the test spans overlap, then LBAs that are common
to the test spans may only be tested a single time.

Or alternatively, the spec could forbid overlapping spans.

* The spec should say what happens if START > END.

6.53.6.8.5.3 Current LBA under test
           The Current LBA under test field shall be written with a
           value of zero by the host. As the self-test progresses, the
           device shall modify this value to contain the beginning LBA
           of the 65,536 sector block currently being tested. When the
           self-test including the off-line scan between test spans
           has been completed, a zero value is placed in this field.

* The spec should say what happens to Current LBA under test in the
  following cases where a selective self-test is underway and then:

Self-test fails
Offline scan fails
User aborts self-test
Drive or host reset

For example: If the self-test fails for any reason (test failure,
hosts reset or SMART Execute Offline Immediate Abort) then the Current
LBA under test field will contain the last value it held prior to the
failure, reset or abort.

* The spec should say if or how Current LBA under test gets reset by
  the device.  For example:

The Current LBA under test field shall be reset to zero by 'Execute
Immediate Short/Extended/Conveyance [Captive] commands.'

* The spec says what happens AFTER the off-line scan between test
  spans has completed, but doesn't say what happens DURING the
  off-line scan.  It would be more useful to have some feedback about
  how the off-line scan is progressing, for example:

As the self-test OR READ SCAN progresses, the device shall modify this
value to contain the beginning LBA of the 65,536 sector block
currently being tested or read.

* The spec should say what happens if Current LBA under test is NOT
  written with zero.  (I suggest below that the WRITE LOG should
  command abort).

6.53.6.8.5.4 Current span under test
           The Current span under test field shall be written with a
           value of zero by the host. As the self-test progresses, the
           device shall modify this value to contain the test span
           number of the current span being tested. If an offline scan
           between test spans is selected, a value greater then five
           is placed in this field during the off-line scan. When the
           self-test including the off-line scan between test spans
           has been completed, a zero value is placed in this field.

* The spec should again say what happens in the following cases:

Self-test fails
Offline scan fails
User aborts self-test with 'SMART Execute Immediate Offline ABORT'
Drive or host reset

For example: If the self-test or offline scan fails or is aborted,
then the Current span under test field will contain the last value it
held prior to the failure or abort.

* The spec should say if or how this value gets reset by the device.
  For example:

The Current Span under test field will be reset to zero by 'Execute
Immediate Offline/Short/Extended/Conveyance [Captive] commands.'

* The spec should say what happens if Current LBA under test is NOT
  written with zero.  (I suggest below that the WRITE LOG should
  command abort).

6.53.6.8.5.5 Feature flags and TABLE 63
      Bit (1) shall be written by the host and returned unmodified by
      the device. Bits (4:3) shall be written as zeros by the host and
      the device shall modify them as the test progresses.

* Here bits 3 and 4 are the active and pending flags.  The spec
  explains what happens to the selective self test in the case of a
  hardware or software abort, but DOESN'T explain what happens to the
  flags either in those or the following cases:

Self-test fails
Offline scan fails
User aborts self-test with 'SMART Execute Immediate Offline ABORT'

For example the spec could say: If the self-test or offline scan fails
or is aborted, then bits 3 and 4 will contain the last values they
held immediately prior to the failure or abort.  This obviously makes
sense for the pending flag; perhaps the active flag should be zeroed?

* The spec doesn't explain what happens if the user tries to write
INVALID data into the selective self-test log (invalid checksum,
invalid data structure revision number, nonzero bits 3 and 4, spans
with START > END, etc.)

My preference would be that the device should NOT allow the host to
write invalid data to the selective self-test log.  [I recently
"bricked" ALL self-test features of a test drive by doing this -- at
least until I figured out how to write valid data to the Selective
self-test log.]

I would argue that the spec should say something like "If the host
attempts to write invalid data to the selective self-test log
(examples: invalid checksum, data structure revision number not equal
to one, illegal values in the span, nonzero values of bits 3/4 in the
flags) then the SMART WRITE LOG command will command abort and the
selective self-test log will remain unchanged."

* The spec should say what happens if the host does NOT zero bits 3
and 4 in the selective self-test log but issues a SMART Execute
Offline Immediate command with whatever values are left behind in the
selective self-test log by the host (perhaps after a failed or aborted
selective self-test).  The two possibilities are:

If a host issues a SMART Execute Offline Immediate ABORT command
during a selective self-test, then it may issue a SMART Execute
Offline Immediate Selective Self-test command without setting zeroing
bits 3/4 in the selective self-test log.  The device will then pick up
the selective self-test where it was interrupted.

OR

If a host issues a "SMART Execute Offline Immediate command" without
zeroing bits 3 and 4 of the selective self-test log, the device will
command abort.

I'm not sure which of these is the better choice.


6.53.4.8.7 SMART Selective self-test routine says:
            The host shall not write the Selective self-test log
            while the execution of a Selective self-test command is in
            progress.

* This leaves undefined the case in which the host tries to write
  DURING a selective self test.  It would be better if it read:

The host shall not write the Selective self-test log while the
execution of a Selective self-test command is in progress.  If the
host attempts to write the Selective self-test log while a Selective
self-test command is in progress, then SMART WRITE LOG will command
abort.  The host must first issue a SMART Execute Offline Immediate
ABORT command to terminate the Selective self-test, before attempting
to write the Selective self-test log.

6.53.4.8.7 SMART Selective self-test routine says:
             When all specified test spans have been completed, the
             test is terminated and the appropriate self-test
             execution status is reported in the SMART READ DATA
             response depending on the occurrence of errors.

* The issue with this, from the application point of view, is that one
can't tell (just by looking at the selective self-test log) if a
selective self-test has terminated or not.  This is because (in the
experiments I have done with some drives that implement selective
self-tests) aborting the selective self-test leaves the selective
self-test log "as it was" with the active bit SET and NONZERO values
in the Current LBA under test and Current Span under test.

For this reason, I belive it would be better to have:

"DURING the execution of the SMART Selective Self-test routine, AND
AFTER its completion, the appropriate self-test execution status will
be reported in the self-test execution status byte of the SMART READ
DATA response.  Any errors that occur during the Selective self-test
will be reported in the self-test execution status byte.

Then, if (as suggested above) running any other SMART self-test
command clears the Selective-self test log active and pending bits and
current LBA and span values, one can conclusively learn the state of
the drive by looking at the Selective Self-test log and the self-test
execution status byte.

6.53.6.8.5.1 Data structure revision number The value of the data
           structure revision number filed 
                                     ^^^^^
* Minor comment: should read "field".

[t13] SMART Selective Self-test

Reply via email to