Re: [SUMMARY] libata EH

Tejun Heo Sat, 20 Aug 2005 10:34:14 -0700


 Hello, guys.


Jeff Garzik wrote:

(added Alan, Bart and Jens to CC)

Tejun Heo wrote:
 Hello, Jeff & libata developers.

 I've been spending some time on libata EH and am trying to make a
list of things which can be improved.  If you have something, please
add to the list.

1. Errors are handled in multiple paths.

    * ATA errors are handled directly in intr context
Simple stuff like "command aborted" (invalid command) can be handledimmediately, no need to kick in the error handling.
But as long as the right hardware interrupts are acknowledged, I don'tmind if all error handling is moved to the thread.

My preference is toward unifying into single path as long asperformance penalty is acceptable for the sake of simplicity.

    * timeouts are handled by ->eng_timeout via SCSI EH

    * ATAPI errors are forwarded to ->eng_timeout in somewhat hackish way
->eng_timeout is somewhat misnamed. A more correct name would be"error_hander". ATAPI errors are intentionally forwarded to SCSI EH viathe normal check-condition handling paths, which cause the SCSI EH to beinvoked.
2. Synchronization

    * SCSI EH entrance is not synchronized with normal processing.
      ATAPI error handling/timeout handling can run concurrently
      with normal command processing.  Albert, I think it's the
      same problem you're trying to solve by moving ATA_QCFLAG_ACTIVE
      clearing.

      http://marc.theaimsgroup.com/?l=linux-ide&m=112417360223374&w=2
The SCSI layer stops all command processing before calling->eh_strategy_handler(). Where do you see that it runs concurrentlywith normal command processing? That should definitely -not- be happening.


 There are currently two problems.

 * As we don't grab host_set lock on entry to ata_scsi_error(), we can
   run concurrently with latter part of ata_qc_complete().  This race is
   addressed by the following patches I've just posted.

   http://marc.theaimsgroup.com/?l=linux-ide&m=112454734102242&w=2

 * After entering EH, normal command completion or spurious interrupt
   can occur.  We currently don't peg those interrupts, so interrupt
   handling can interfere with EH.

    * SCSI EH entrance is not synchronized with polling tasks.
Yes, this definitely needs fixing.

Luckily the polling task is very rarely used, by normal users.
3. Error handling too weak

    * We need to check the device responds to commands (say, w/
      IDENTIFY or CHK_POWER) after an error, and then reset if it
          doens't.  To do this, we need to handle all errors in EH.
First you need to classify the error, then handle based on thatclassification.
- DMA and PCI bus errors are usually indicated via status bits incontroller registers, such as ATA_DMA_ERR on standard PCI IDE controllers.
- Device errors will be indicated via the ATA Status and Error registers.
- PCI bus errors should be handled by resetting the host controller (ifpossible), and then retrying the command [NOTE: better suggestions welcome]
- DMA errors should be handled by hueristics: If more than $N (3?) DMAerrors happen in 15 minutes,
* decrease SATA PHY speed.  if speed cannot be decreased,
* decrease UDMA xfer speed.  if at UDMA0, switch to PIO4
* decrease PIO xfer speed.  if at PIO3, complain, but continue

Commands should be retried after DMA errors.
- Device errors should handled as per ATA specs. Usually these error donot cause a retry, but they may require additional inquiry such as READLOG EXT if its a media error.
4. Better error reporting

    * We currently depend on ATA_STAT and ATA_ERR register values
      to check for and report errors.  As Jeff said, this is way
      too crude.  We need better error reporting.  I think having
      unified code path for error handling will help implementing
      this.
This will fall naturally out of better error handling (stuff I describedabove).
5. EH is currently holding off other improvements

    * NCQ controllers (or any other non-legacy ATA interface based
          ones) don't fit nicely into current ATA error handling in
          interrupt scheme.  As NCQ errors require issuing read log
      command, we need to be in EH context to handle these errors.

    * To properly implement hotplug, we need to have solid error
          handling.  IMHO, implementing hotplug with current EH will
      be quite fragile.
Correct. Both NCQ and hotplug really want decent error handling --particularly hotplug. Hotplug will likely travel the error handlingpaths, though at that point we prefer the English word "exception"rather than "error" :)
 Whether we choose to stay with ->eh_strategy_handler or move over to
fine-grained SCSI EH, IMHO, libata EH needs some work.  So... how
should we proceed?
Well, moving over to the fine-grained hooks should make the remainingproblems Mark sees go away, as well as guaranteeing that we get commandcompletion right.
Staying with ->eh_strategy_handler() means auditing the entire SCSIlayer to check and see what gets set on error, which libata must thenmanually reset. Obviously there are still some bugs in this area, asMark has demonstrated. The big reason why I don't like->eh_strategy_handler() is that it continues to be an unknown quantity,in terms of, we don't still have a complete list of things-to-do duringerror handling.

As I wrote previously, I'm not very sure if Mark and I are looking atthe same problem, but I think that once we know how the system is lockedup, the debugging shouldn't be that difficult.

As there are concerns regarding semantics of ->eh_strategy_handler andit's a less-used and less-charted territory, I'm gonna try to write adocument describing the following.


 * How SCSI EH works and commands flow through it with the default
   fine-grained hooks.
 * From above, extract what ->eh_strategy_handler() should do.
 * What libata error conditions are there and how qc's should be
   handle.
 * How to integrate libata EH into SCSI EH without losing commands.

I don't how good the doc will turn out (don't expect too much), but Ihope it could serve as a basis for discussion if nothing else.

If you are interested in tackling the work, you're more than welcome tochoose either path. Either way, the work won't go to waste.
    Jeff

After writing above mentioned doc, I'll try to improve/revise andbreak down my previously posted EH patchset and explain how they conformto above yet-to-be-written document such that it can be betterunderstood and easier to review/debug.


 Thanks a lot.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SUMMARY] libata EH

Reply via email to