On Wed, Feb 01, 2017 at 06:51:00PM -0500, Sanka Coffie wrote:
> On Wed, Feb 1, 2017 at 6:28 AM, Jonathan Matthew <[email protected]> wrote:
> 
> > On Wed, Feb 01, 2017 at 04:16:34AM -0500, Sanka Coffie wrote:
> > > Here is the output of dmesg with debugging enabled on ahci, running with
> > > -current that I pulled down a few hours ago. The line matches up with the
> > > driver which is an improvement.
> >
> > OK, that looks like the mechanism for determining which NCQ command failed
> > is failing, so let's see what happens if we turn NCQ off.  Does this diff
> > change anything?
> >
> >
> > Index: ahci_pci.c
> > ===================================================================
> > RCS file: /cvs/src/sys/dev/pci/ahci_pci.c,v
> > retrieving revision 1.12
> > diff -u -p -u -p -r1.12 ahci_pci.c
> > --- ahci_pci.c  14 Jan 2016 04:06:53 -0000      1.12
> > +++ ahci_pci.c  1 Feb 2017 11:26:01 -0000
> > @@ -281,7 +281,7 @@ ahci_amd_hudson2_attach(struct ahci_soft
> >  {
> >         ahci_ati_sb_idetoahci(sc, pa);
> >
> > -       sc->sc_flags |= AHCI_F_IPMS_PROBE;
> > +       sc->sc_flags |= AHCI_F_IPMS_PROBE | AHCI_F_NO_NCQ;
> >
> >         return (0);
> >  }
> >
> >
> Yep, just installed with that change and no more kernel panic. I was also
> able to reboot, as well as run df and ls on the 2nd drive without issue.

Every other ahci driver I've looked at doesn't consider this a fatal error,
so maybe we shouldn't either.  The diff below adds a bit of dmesg spam during
NCQ error recovery, and also attempts to handle the condition we're seeing
here by failing all outstanding commands rather than panicking.  I'd be really
interested to see what happens if you remove the AHCI_F_NO_NCQ change and try
this instead.


Index: ahci.c
===================================================================
RCS file: /cvs/src/sys/dev/ic/ahci.c,v
retrieving revision 1.28
diff -u -p -u -p -r1.28 ahci.c
--- ahci.c      2 Oct 2016 18:56:05 -0000       1.28
+++ ahci.c      21 Feb 2017 03:51:09 -0000
@@ -2158,6 +2158,12 @@ ahci_port_intr(struct ahci_port *ap, u_i
                                PORTNAME(ap), err_slot);
 
                        ccb = &ap->ap_ccbs[err_slot];
+                       if (ccb->ccb_xa.state != ATA_S_ONCHIP) {
+                               printf("%s: NCQ errored slot %d is idle"
+                                   " (%08x active)\n", PORTNAME(ap), err_slot,
+                                   ci_saved);
+                               goto failall;
+                       }
                } else {
                        /* Didn't reset, could gather extended info from log. */
                }
@@ -2572,9 +2578,21 @@ err:
        /* Extract failed register set and tags from the scratch space. */
        if (rc == 0) {
                struct ata_log_page_10h         *log;
-               int                             err_slot;
+               int                             err_slot, i;
+               uint8_t                         sum;
 
                log = (struct ata_log_page_10h *)ap->ap_err_scratch;
+               sum = 0;
+               for (i = 0; i < sizeof(*log); i++)
+                       sum += ap->ap_err_scratch[i];
+               if (sum != 0)
+                       printf("%s: NCQ error log checksum mismatch\n",
+                           PORTNAME(ap));
+
+               printf("%s: ncq error: %x %x %x %x\n", PORTNAME(ap),
+                   log->err_regs.type, log->err_regs.flags,
+                   log->err_regs.status, log->err_regs.error);
+
                if (ISSET(log->err_regs.type, ATA_LOG_10H_TYPE_NOTQUEUED)) {
                        /* Not queued bit was set - wasn't an NCQ error? */
                        printf("%s: read NCQ error page, but not an NCQ "

Reply via email to