I started with -current from 20170217 (i.e. no AHCI_F_NO_NCQ change) and
applied this patch. No issues during installation or boot. The full output
of dmesg is below, though I'm not seeing any of the prints from the patch
(I verified that ahci.c had the patch applied).
# dmesg
OpenBSD 6.0-current (GENERIC.MP) #1: Tue Feb 21 17:16:58 EST 2017
root@buildvm:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 4261076992 (4063MB)
avail mem = 4127272960 (3936MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdffb7020 (7 entries)
bios0: vendor coreboot version "88a4f96" date 03/11/2016
bios0: PC Engines apu2
acpi0 at bios0: rev 2
acpi0: sleep states S0 S1 S2 S3 S4 S5
acpi0: tables DSDT FACP SSDT APIC HEST SSDT SSDT HPET
acpi0: wakeup devices PWRB(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) PBR8(S4)
UOH1(S3) UOH3(S3) UOH5(S3) XHC0(S4)
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD GX-412TC SOC, 998.26 MHz
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,NXE,MMXX,FFXSR,PAGE1GB,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TOPEXT,ITSC,BMI1
cpu0: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 2MB
64b/line 16-way L2 cache
cpu0: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative
cpu0: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu0: TSC frequency 998261040 Hz
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD GX-412TC SOC, 998.13 MHz
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,NXE,MMXX,FFXSR,PAGE1GB,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TOPEXT,ITSC,BMI1
cpu1: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 2MB
64b/line 16-way L2 cache
cpu1: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative
cpu1: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: AMD GX-412TC SOC, 998.13 MHz
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,NXE,MMXX,FFXSR,PAGE1GB,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TOPEXT,ITSC,BMI1
cpu2: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 2MB
64b/line 16-way L2 cache
cpu2: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative
cpu2: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 3 (application processor)
cpu3: AMD GX-412TC SOC, 998.32 MHz
cpu3:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,NXE,MMXX,FFXSR,PAGE1GB,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TOPEXT,ITSC,BMI1
cpu3: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 2MB
64b/line 16-way L2 cache
cpu3: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative
cpu3: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative
cpu3: smt 0, core 3, package 0
ioapic0 at mainbus0: apid 4 pa 0xfec00000, version 21, 24 pins
ioapic1 at mainbus0: apid 5 pa 0xfec20000, version 21, 32 pins
acpihpet0 at acpi0: 14318180 Hz
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus -1 (PBR4)
acpiprt2 at acpi0: bus 1 (PBR5)
acpiprt3 at acpi0: bus 2 (PBR6)
acpiprt4 at acpi0: bus 3 (PBR7)
acpiprt5 at acpi0: bus -1 (PBR8)
acpicpu0 at acpi0: C2(0@400 io@0x1771), C1(@1 halt!), PSS
acpicpu1 at acpi0: C2(0@400 io@0x1771), C1(@1 halt!), PSS
acpicpu2 at acpi0: C2(0@400 io@0x1771), C1(@1 halt!), PSS
acpicpu3 at acpi0: C2(0@400 io@0x1771), C1(@1 halt!), PSS
acpibtn0 at acpi0: PWRB
"PNP0501" at acpi0 not configured
cpu0: 998 MHz: speeds: 1000 800 600 MHz
pci0 at mainbus0 bus 0
pchb0 at pci0 dev 0 function 0 "AMD AMD64 16h Root Complex" rev 0x00
pchb1 at pci0 dev 2 function 0 "AMD AMD64 16h Host" rev 0x00
ppb0 at pci0 dev 2 function 2 "AMD AMD64 16h PCIE" rev 0x00: msi
pci1 at ppb0 bus 1
em0 at pci1 dev 0 function 0 "Intel I210" rev 0x03: msi, address
00:0d:b9:42:75:94
ppb1 at pci0 dev 2 function 3 "AMD AMD64 16h PCIE" rev 0x00: msi
pci2 at ppb1 bus 2
em1 at pci2 dev 0 function 0 "Intel I210" rev 0x03: msi, address
00:0d:b9:42:75:95
ppb2 at pci0 dev 2 function 4 "AMD AMD64 16h PCIE" rev 0x00: msi
pci3 at ppb2 bus 3
em2 at pci3 dev 0 function 0 "Intel I210" rev 0x03: msi, address
00:0d:b9:42:75:96
"AMD CCP" rev 0x00 at pci0 dev 8 function 0 not configured
xhci0 at pci0 dev 16 function 0 "AMD Bolton xHCI" rev 0x11: msi
usb0 at xhci0: USB revision 3.0
uhub0 at usb0 configuration 1 interface 0 "AMD xHCI root hub" rev 3.00/1.00
addr 1
ahci0 at pci0 dev 17 function 0 "AMD Hudson-2 SATA" rev 0x40: apic 4 int
19, AHCI 1.3
ahci0: port 0: 6.0Gb/s
ahci0: port 1: 6.0Gb/s
scsibus1 at ahci0: 32 targets
sd0 at scsibus1 targ 0 lun 0: <ATA, SATA SSD, S9FM> SCSI3 0/direct fixed
t10.ATA_SATA_SSD_8EC60766164804905620
sd0: 15272MB, 512 bytes/sector, 31277232 sectors, thin
sd1 at scsibus1 targ 1 lun 0: <ATA, Crucial_CT250MX2, MU04> SCSI3 0/direct
fixed naa.500a0751136e86ac
sd1: 238475MB, 512 bytes/sector, 488397168 sectors, thin
ehci0 at pci0 dev 19 function 0 "AMD Hudson-2 USB2" rev 0x39: apic 4 int 18
usb1 at ehci0: USB revision 2.0
uhub1 at usb1 configuration 1 interface 0 "AMD EHCI root hub" rev 2.00/1.00
addr 1
piixpm0 at pci0 dev 20 function 0 "AMD Hudson-2 SMBus" rev 0x42: SMBus
disabled
pcib0 at pci0 dev 20 function 3 "AMD Hudson-2 LPC" rev 0x11
sdhc0 at pci0 dev 20 function 7 "AMD Bolton SD/MMC" rev 0x01: apic 4 int 16
sdhc0: SDHC 2.0, 63 MHz base clock
sdmmc0 at sdhc0: 4-bit, sd high-speed, mmc high-speed, dma
pchb2 at pci0 dev 24 function 0 "AMD AMD64 16h Link Cfg" rev 0x00
pchb3 at pci0 dev 24 function 1 "AMD AMD64 16h Address Map" rev 0x00
pchb4 at pci0 dev 24 function 2 "AMD AMD64 16h DRAM Cfg" rev 0x00
km0 at pci0 dev 24 function 3 "AMD AMD64 16h Misc Cfg" rev 0x00
pchb5 at pci0 dev 24 function 4 "AMD AMD64 16h CPU Power" rev 0x00
pchb6 at pci0 dev 24 function 5 "AMD AMD64 16h Misc Cfg" rev 0x00
isa0 at pcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
wbsio0 at isa0 port 0x2e/2: NCT5104D rev 0x52
vmm at mainbus0 not configured
uhub2 at uhub1 port 1 configuration 1 interface 0 "Advanced Micro Devices
product 0x7900" rev 2.00/0.18 addr 2
vscsi0 at root
scsibus2 at vscsi0: 256 targets
softraid0 at root
scsibus3 at softraid0: 256 targets
root on sd0a (402af328685601ff.a) swap on sd0b dump on sd0b
Process (pid 1) got signal 31
On Mon, Feb 20, 2017 at 11:20 PM, Jonathan Matthew <[email protected]>
wrote:
> On Wed, Feb 01, 2017 at 06:51:00PM -0500, Sanka Coffie wrote:
> > On Wed, Feb 1, 2017 at 6:28 AM, Jonathan Matthew <[email protected]>
> wrote:
> >
> > > On Wed, Feb 01, 2017 at 04:16:34AM -0500, Sanka Coffie wrote:
> > > > Here is the output of dmesg with debugging enabled on ahci, running
> with
> > > > -current that I pulled down a few hours ago. The line matches up
> with the
> > > > driver which is an improvement.
> > >
> > > OK, that looks like the mechanism for determining which NCQ command
> failed
> > > is failing, so let's see what happens if we turn NCQ off. Does this
> diff
> > > change anything?
> > >
> > >
> > > Index: ahci_pci.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/dev/pci/ahci_pci.c,v
> > > retrieving revision 1.12
> > > diff -u -p -u -p -r1.12 ahci_pci.c
> > > --- ahci_pci.c 14 Jan 2016 04:06:53 -0000 1.12
> > > +++ ahci_pci.c 1 Feb 2017 11:26:01 -0000
> > > @@ -281,7 +281,7 @@ ahci_amd_hudson2_attach(struct ahci_soft
> > > {
> > > ahci_ati_sb_idetoahci(sc, pa);
> > >
> > > - sc->sc_flags |= AHCI_F_IPMS_PROBE;
> > > + sc->sc_flags |= AHCI_F_IPMS_PROBE | AHCI_F_NO_NCQ;
> > >
> > > return (0);
> > > }
> > >
> > >
> > Yep, just installed with that change and no more kernel panic. I was also
> > able to reboot, as well as run df and ls on the 2nd drive without issue.
>
> Every other ahci driver I've looked at doesn't consider this a fatal error,
> so maybe we shouldn't either. The diff below adds a bit of dmesg spam
> during
> NCQ error recovery, and also attempts to handle the condition we're seeing
> here by failing all outstanding commands rather than panicking. I'd be
> really
> interested to see what happens if you remove the AHCI_F_NO_NCQ change and
> try
> this instead.
>
>
> Index: ahci.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/ic/ahci.c,v
> retrieving revision 1.28
> diff -u -p -u -p -r1.28 ahci.c
> --- ahci.c 2 Oct 2016 18:56:05 -0000 1.28
> +++ ahci.c 21 Feb 2017 03:51:09 -0000
> @@ -2158,6 +2158,12 @@ ahci_port_intr(struct ahci_port *ap, u_i
> PORTNAME(ap), err_slot);
>
> ccb = &ap->ap_ccbs[err_slot];
> + if (ccb->ccb_xa.state != ATA_S_ONCHIP) {
> + printf("%s: NCQ errored slot %d is idle"
> + " (%08x active)\n", PORTNAME(ap),
> err_slot,
> + ci_saved);
> + goto failall;
> + }
> } else {
> /* Didn't reset, could gather extended info from
> log. */
> }
> @@ -2572,9 +2578,21 @@ err:
> /* Extract failed register set and tags from the scratch space. */
> if (rc == 0) {
> struct ata_log_page_10h *log;
> - int err_slot;
> + int err_slot, i;
> + uint8_t sum;
>
> log = (struct ata_log_page_10h *)ap->ap_err_scratch;
> + sum = 0;
> + for (i = 0; i < sizeof(*log); i++)
> + sum += ap->ap_err_scratch[i];
> + if (sum != 0)
> + printf("%s: NCQ error log checksum mismatch\n",
> + PORTNAME(ap));
> +
> + printf("%s: ncq error: %x %x %x %x\n", PORTNAME(ap),
> + log->err_regs.type, log->err_regs.flags,
> + log->err_regs.status, log->err_regs.error);
> +
> if (ISSET(log->err_regs.type, ATA_LOG_10H_TYPE_NOTQUEUED))
> {
> /* Not queued bit was set - wasn't an NCQ error? */
> printf("%s: read NCQ error page, but not an NCQ "
>