Remco [re...@d-compu.dyndns.org] wrote: > Chris Cappuccio wrote: > > > here is the key error message. it means your whole ahci disk has > > disappeared (and anything you can still run is happening from cache.) > > > > -- > > ahci0: stopping the port, softreset slot 31 was still active. > > ahci0: failed to reset port during timeout handling, disabling it > > -- > > > > likely a reboot will fix it. this is a known problem with ahci driver and > > intel ahci controllers. > > I am not so sure this is a driver problem. > > I think I accidentilly "emulated" this problem the other day on my desktop > system (not a 6501): > Nov 28 16:38:44 ws0001 /bsd: ahci1: stopping the port, softreset slot 31 was > still active. > Nov 28 16:38:44 ws0001 /bsd: ahci1: failed to reset port during timeout > handling, disabling it > > I have this external drive bay connected through e-SATA. After unmounting > the drive I switched off the external drive's power. Running disklabel on > the drive resulted in the above failures, which I guess makes sense, after > all, I made the drive "disappear". >
i "emulated" it with softraid on intel ahci, plus a ridiculously heavy disk load with load averages above 20 for 24 hours per day, and got virtually the same error. it went away with softraid in ide mode. (sounds like my problem and yours were totally different.) the online git history for dragonfly's ahci driver has some interesting things that we might want to pay attention to: http://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/sys/dev/disk/ahci/ahci.c but here are a few of dfly's interesting commits (in the context of softreset errors and other various bugfixes): http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5f8c1efd092f1039f63ce166e87fce744882a390 "* The softreset code did not properly initialize ccb_xa.flags, causing the softreset FIS's to sometimes get queued as an NCQ command instead of as a non-NCQ command. * Make ahci_poll() a bit more robust. Properly set ccb_xa.state on timeout, check for unexpected completions, and check to see if the ccb was put on a queue (though the latter should never happen since active/sactive is cleared by ahci_get_err_ccb())." http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b089d0bfa66d195e24630869579f0cbc1a48e459 "* Add a small delay after sending the RESET FIS in softreset before sending the second FIS. * Add a small delay after the device succesfully unbusies before starting normal commands." http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/cf5f3a81b0df75cd4844c3c89f008e209d59b218 "* Change the reset sequence. If the first hardreset fails do a second hardreset. If that fails then try doing a softreset. This seems to catch all the cases. It is unclear why the reset sequence fails at random points but it seems to be a combination of the port command processor state and the device state. COMRESET does not actually reset everything like its supposed to. * Temporarily set ap_state to AP_S_NORMAL when starting a reset sequence so commands do not just fail due to a previously failed condition on the port. * Restoration of command register state now depends on whether the reset succeeded or failed. * Note that only SERR_DIAG_X needs to be cleared to allow for the next TFD update. These updates are serialized by the controller and there may be more then one. Add a function ahci_flush_tfd() which flushes all of them. * Add ahci_port_hardstop() for dealing with failed ports and device removals, instead of using ahci_port_hardreset(). This function tries to do multiple transitions via section 10.10.1. These transitions are not well documented by the standard. * Fix ahci_poll() to not queue a command if the port is in a failed state, as this really messes up our port processing state machine." http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/8bf6a3ffff3f591ca1c66a0a44d51c5535c5cb1b "Not only does the DHRS interrupt not stop command processing (which means that ahci_pm_read() needs to be single-threaded by the way, which we only do by happenstance atm), but we were stopping and starting the port without reloading commands in-progress." http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/dbef6246a8d336160c284860369a8a39d4902440 "* The IS register was not being properly masked for the fall-through." http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/4c339a5f388e695f1464ef583b1d0f20b03c5233 "* Stopping a port with ahci_port_stop() is problematic if the port has not already been stopped by the command processor (CR is inactive), because command completions can race our saved CI register, leading to double-issues. This creates issues with both NCQ and FBSS support. Change the timeout code to idle the port by allowing commands to complete normally until the only commands remaining are expired. Then the port can be safely stopped. The timeout code also no longer performs a softreset. It used to under certain conditions. * With the changes to the timeout code softreset is no longer being called from the timeout code, remove hacks from the softreset code that attempted to restore the command processing state on failure." http://gitweb.dragonflybsd.org/dragonfly.git/commit/f2dba7003b2add226b3999a41a99fd278cc5a26f "kernel - Add AHCI workaround for Intel mobo / Intel SSD probing bug * On cold boot Intel SSDs for some reason seem to fail to initialize on the first attempt. The AHCI port winds up getting stuck in BSY mode. Adjusting timeouts fails to solve the problem. Ignoring the BSY state does solve the problem but is undesireable. * Retry the initialization sequence once if a stuck BSY is detected as a workaround. This appears to properly detect the SSD on the second attempt. * Add a delay after clearing the power control state before starting the COMINIT sequence. This solves no known issues but is probably a good idea." > > > > the "failed to reset port" and "softreset slot was still active" problems > > become really obvious once you start maxing out disks on an ahci > > controller with a softraid array. they rarely present problems in normal > > use! but, the SSD sata drive may evoke different behavior for some reason. > > i think continuous runs of iogen over a RAID1 array might bring out > > similar issues all by itself, even with regular hard disks > > > > Maxing out disks sounds like having more activity on the disks, possibly > making them draw more power. Could these errors relate to bad power cabling > or insufficient power supply ? > In this guy's case, it's certainly possible. Wasn't in mine.