Re: Fwd: mSATA failure on 6501 w/ OpenBSD 5.0

Chris Cappuccio Tue, 29 Nov 2011 12:22:00 -0800

Remco [re...@d-compu.dyndns.org] wrote:
> Chris Cappuccio wrote:
> 
> > here is the key error message. it means your whole ahci disk has
> > disappeared (and anything you can still run is happening from cache.)
> > 
> > --
> > ahci0: stopping the port, softreset slot 31 was still active.
> > ahci0: failed to reset port during timeout handling, disabling it
> > --
> > 
> > likely a reboot will fix it. this is a known problem with ahci driver and
> > intel ahci controllers.
> 
> I am not so sure this is a driver problem.
> 
> I think I accidentilly "emulated" this problem the other day on my desktop
> system (not a 6501):
> Nov 28 16:38:44 ws0001 /bsd: ahci1: stopping the port, softreset slot 31 was
> still active.
> Nov 28 16:38:44 ws0001 /bsd: ahci1: failed to reset port during timeout
> handling, disabling it
> 
> I have this external drive bay connected through e-SATA. After unmounting
> the drive I switched off the external drive's power. Running disklabel on
> the drive resulted in the above failures, which I guess makes sense, after
> all, I made the drive "disappear".
>


i "emulated" it with softraid on intel ahci, plus a ridiculously heavy disk 
load with load averages above 20 for 24 hours per day, and got virtually the 
same error. it went away with softraid in ide mode. (sounds like my problem and 
yours were totally different.)

the online git history for dragonfly's ahci driver has some interesting things 
that we might want to pay attention to:

http://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/sys/dev/disk/ahci/ahci.c

but here are a few of dfly's interesting commits (in the context of softreset 
errors and other various bugfixes):

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5f8c1efd092f1039f63ce166e87fce744882a390

"* The softreset code did not properly initialize ccb_xa.flags, causing
  the softreset FIS's to sometimes get queued as an NCQ command instead of
  as a non-NCQ command.

* Make ahci_poll() a bit more robust.  Properly set ccb_xa.state on
  timeout, check for unexpected completions, and check to see if the
  ccb was put on a queue (though the latter should never happen since
  active/sactive is cleared by ahci_get_err_ccb())."

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b089d0bfa66d195e24630869579f0cbc1a48e459

"* Add a small delay after sending the RESET FIS in softreset before
  sending the second FIS.

* Add a small delay after the device succesfully unbusies before starting
  normal commands."

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/cf5f3a81b0df75cd4844c3c89f008e209d59b218

"* Change the reset sequence.  If the first hardreset fails do a second
  hardreset.  If that fails then try doing a softreset.  This seems to
  catch all the cases.  It is unclear why the reset sequence fails at
  random points but it seems to be a combination of the port command
  processor state and the device state.  COMRESET does not actually reset
  everything like its supposed to.

* Temporarily set ap_state to AP_S_NORMAL when starting a reset
  sequence so commands do not just fail due to a previously failed
  condition on the port.

* Restoration of command register state now depends on whether the
  reset succeeded or failed.

* Note that only SERR_DIAG_X needs to be cleared to allow for the
  next TFD update.  These updates are serialized by the controller
  and there may be more then one.  Add a function ahci_flush_tfd() which
  flushes all of them.

* Add ahci_port_hardstop() for dealing with failed ports and device
  removals, instead of using ahci_port_hardreset().  This function
  tries to do multiple transitions via section 10.10.1.  These
  transitions are not well documented by the standard.

* Fix ahci_poll() to not queue a command if the port is in a failed
  state, as this really messes up our port processing state machine."

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/8bf6a3ffff3f591ca1c66a0a44d51c5535c5cb1b

"Not only does the DHRS interrupt not stop command processing (which means
that ahci_pm_read() needs to be single-threaded by the way, which we only do
by happenstance atm), but we were stopping and starting the port without
reloading commands in-progress."

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/dbef6246a8d336160c284860369a8a39d4902440

"* The IS register was not being properly masked for the fall-through."

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/4c339a5f388e695f1464ef583b1d0f20b03c5233

"* Stopping a port with ahci_port_stop() is problematic if the port has not
  already been stopped by the command processor (CR is inactive), because
  command completions can race our saved CI register, leading to
  double-issues.  This creates issues with both NCQ and FBSS support.

  Change the timeout code to idle the port by allowing commands to
  complete normally until the only commands remaining are expired.
  Then the port can be safely stopped.

  The timeout code also no longer performs a softreset.  It used to under
  certain conditions.

* With the changes to the timeout code softreset is no longer being called
  from the timeout code, remove hacks from the softreset code that attempted
  to restore the command processing state on failure."

http://gitweb.dragonflybsd.org/dragonfly.git/commit/f2dba7003b2add226b3999a41a99fd278cc5a26f

"kernel - Add AHCI workaround for Intel mobo / Intel SSD probing bug

* On cold boot Intel SSDs for some reason seem to fail to initialize on
  the first attempt.  The AHCI port winds up getting stuck in BSY mode.
  Adjusting timeouts fails to solve the problem.  Ignoring the BSY state
  does solve the problem but is undesireable.

* Retry the initialization sequence once if a stuck BSY is detected
  as a workaround.  This appears to properly detect the SSD on the second
  attempt.

* Add a delay after clearing the power control state before starting the
  COMINIT sequence.  This solves no known issues but is probably a good
  idea."

> > 
> > the "failed to reset port" and "softreset slot was still active" problems
> > become really obvious once you start maxing out disks on an ahci
> > controller with a softraid array. they rarely present problems in normal
> > use! but, the SSD sata drive may evoke different behavior for some reason.
> > i think continuous runs of iogen over a RAID1 array might bring out
> > similar issues all by itself, even with regular hard disks
> > 
> 
> Maxing out disks sounds like having more activity on the disks, possibly
> making them draw more power. Could these errors relate to bad power cabling
> or insufficient power supply ?
> 

In this guy's case, it's certainly possible. Wasn't in mine.

Re: Fwd: mSATA failure on 6501 w/ OpenBSD 5.0

Reply via email to