On Sun, 2019-10-06 at 09:20 +1000, Jonathan Matthew wrote:
> On Fri, Oct 04, 2019 at 06:24:16PM -0400, [email protected] wrote:
> > 
> > > 
> > > Synopsis: panic after ahci0: log page read failed, slot 31 was still 
> > > active
> > > Category: kernel
> > > Environment:
> >     System      : OpenBSD 6.6
> >     Details     : OpenBSD 6.6-beta (GENERIC.MP) #245: Sat Sep 28 20:43:51 
> > MDT 2019
> >                      
> > [email protected]:/usr/src/sys/arch/arm64/compile/GENERIC.MP
> > 
> >     Architecture: OpenBSD.arm64
> >     Machine     : arm64
> > > 
> > > Description:
> >     While building lang/rust received some ahci0 messages then a panic.
> >     System is a RockPro64 with 4G memory. filesystem root is on uSD, the
> >     rest of the paritions are on SSD (including swap).
> > > 
> > > How-To-Repeat:
> >     Not sure if reproducable yet. I was building lang/rust with
> >     ulimit -Sd 4194304.
> > > 
> > > Fix:
> >     Unknown
> > 
> > 
> > ahci0: log page read failed, slot 31 was still active.
> > ahci0: device didn't come ready after reset, TFD: 0x84c1<BSY,ERR>
> > panic: uvm_fault failed: ffffff80003475b8
> So, the 'log page read' happens when there are multiple commands active and 
> the
> port reports an error.  The log page read is supposed to tell us which command
> failed.  If that fails, we fail all active commands, and if the device won't
> reset after that, we shut the device off and all further io will fail.  If
> that's your swap device, and the system is swapping, you're kind of screwed.
> 
> Perhaps disabling command queueing (which means there can only be one command
> in flight, so no need to read the log page on errors) might help?  The diff
> below should do that.  Checking the SSD out with smartctl is probably also a
> good idea at this point.
> 
> 
> diff --git sys/dev/pci/ahci_pci.c sys/dev/pci/ahci_pci.c
> index 79044b52dd5..f61ae96b0cf 100644
> --- sys/dev/pci/ahci_pci.c
> +++ sys/dev/pci/ahci_pci.c
> @@ -108,7 +108,8 @@ static const struct ahci_device ahci_devices[] = {
>       { PCI_VENDOR_ATI,       PCI_PRODUCT_ATI_SBX00_SATA_6,
>           NULL,               ahci_ati_sb700_attach },
>  
> -     { PCI_VENDOR_ASMEDIA,   PCI_PRODUCT_ASMEDIA_ASM1061_SATA },
> +     { PCI_VENDOR_ASMEDIA,   PCI_PRODUCT_ASMEDIA_ASM1061_SATA,
> +         NULL,               ahci_vt8251_attach },
>  
>       { PCI_VENDOR_INTEL,     PCI_PRODUCT_INTEL_6SERIES_AHCI_1,
>           NULL,               ahci_intel_attach },
> 

Thank you for the diff to test.

I noticed that I can easily reproduce the problem using sysupgrade.
>50% of the time the problem will trip there. My /home partition is
on the SSD, so it is reading and writing from it while installing
the sets. I also have reproduced the problem using bsd.rd and
using http for source of the sets with less frequency of tripping
the error. Seems like the ramdisk environment increases the chances
of tripping the issue. While running in multiuser bsd.mp it is
much more difficult to reproduce. I have only been able to trip
it one more time using fio with a random read/write test but lots
of heavy ports building use has not tripped the issue.

I applied your diff and built RAMDISK to test it. However, the
boot failed with this (full boot message later on):

root on rd0a swap on rd0b dump on rd0b
panic: cannot open disk, 0x1100/0x2f02, error 2
syncing disks... done

smartmontools seems to indicate the SSD isn't failing but
I'm not that familiar with the output:

smartctl -a /dev/sd0c 
smartctl 7.0 2018-12-30 r4883 [aarch64-unknown-openbsd6.6] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 500GB
Serial Number:    S1DHNSAD921017A
LU WWN Device Id: 5 002538 8a003c28f
Firmware Version: EXT0BB0Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct  6 18:31:59 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  32) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline 
data collection:                ( 6600) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 110) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       
-       21750
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       
-       399
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       
-       16
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       
-       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       
-       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       
-       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       
-       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0032   075   054   000    Old_age   Always       
-       25
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       
-       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       
-       13
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       
-       154
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       
-       21761092635

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Full boot messages with diff applied to bsd.rd:

U-Boot SPL 2019.10-rc4 (Oct 03 2019 - 19:40:31 -0400)
Trying to boot from MMC1
NOTICE:  BL31: v2.1(debug):2.1
NOTICE:  BL31: Built : 10:16:34, Sep 27 2019
INFO:    GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3
INFO:    plat_rockchip_pmu_init(1596): pd status 3e
INFO:    BL31: Initializing runtime services
WARNING: BL31: cortex_a53: CPU workaround for 819472 was missing!
WARNING: BL31: cortex_a53: CPU workaround for 824069 was missing!
WARNING: BL31: cortex_a53: CPU workaround for 827319 was missing!
INFO:    BL31: cortex_a53: CPU workaround for 855873 was applied
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x200000
INFO:    SPSR = 0x3c9


U-Boot 2019.10-rc4 (Oct 03 2019 - 19:40:31 -0400)

Model: Pine64 RockPro64
DRAM:  3.9 GiB
MMC:   dwmmc@fe320000: 1, sdhci@fe330000: 0
Loading Environment from MMC... Card did not respond to voltage select!
*** Warning - No block device, using default environment

In:    serial@ff1a0000
Out:   serial@ff1a0000
Err:   serial@ff1a0000
Model: Pine64 RockPro64
rockchip_dnl_key_pressed: adc_channel_single_shot fail!
Net:   eth0: ethernet@fe300000
Hit any key to stop autoboot:  0 
Card did not respond to voltage select!
switch to partitions #0, OK
mmc1 is current device
Scanning mmc 1:1...
Found EFI removable media binary efi/boot/bootaa64.efi
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Scanning disk [email protected]...
Card did not respond to voltage select!
Scanning disk [email protected]...
Disk [email protected] not ready
Found 3 disks
BootOrder not defined
EFI boot manager: Cannot load any image
161090 bytes read in 46 ms (3.3 MiB/s)
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
disks: sd0*
>> OpenBSD/arm64 BOOTAA64 0.19
boot> bsd.rd
booting sd0a:bsd.rd: 2224220+622316+8769504+739568 
[221760+109+519552+200640]=0xff7c80
type 0x2 pa 0x200000 va 0x200000 pages 0x4000 attr 0x8
type 0x7 pa 0x4200000 va 0x4200000 pages 0x3eec attr 0x8
type 0x4 pa 0x80ec000 va 0x80ec000 pages 0x28 attr 0x8
type 0x7 pa 0x8114000 va 0x8114000 pages 0xec099 attr 0x8
type 0x2 pa 0xf41ad000 va 0xf41ad000 pages 0xc3b attr 0x8
type 0x4 pa 0xf4de8000 va 0xf4de8000 pages 0x1 attr 0x8
type 0x2 pa 0xf4de9000 va 0xf4de9000 pages 0x3 attr 0x8
type 0x7 pa 0xf4dec000 va 0xf4dec000 pages 0x1 attr 0x8
type 0x2 pa 0xf4ded000 va 0xf4ded000 pages 0x100 attr 0x8
type 0x1 pa 0xf4eed000 va 0xf4eed000 pages 0x28 attr 0x8
type 0x0 pa 0xf4f15000 va 0xf4f15000 pages 0x7 attr 0x8
type 0x4 pa 0xf4f1c000 va 0xf4f1c000 pages 0x1 attr 0x8
type 0x6 pa 0xf4f1d000 va 0x8cb8de000 pages 0x1 attr 0x8000000000000008
type 0x4 pa 0xf4f1e000 va 0xf4f1e000 pages 0x2 attr 0x8
type 0x0 pa 0xf4f20000 va 0xf4f20000 pages 0x4 attr 0x8
type 0x4 pa 0xf4f24000 va 0xf4f24000 pages 0x2 attr 0x8
type 0x6 pa 0xf4f26000 va 0x8cb8e7000 pages 0x1 attr 0x8000000000000008
type 0x2 pa 0xf4f27000 va 0xf4f27000 pages 0x3019 attr 0x8
type 0x5 pa 0xf7f40000 va 0x8ce901000 pages 0x10 attr 0x8000000000000008
type 0x2 pa 0xf7f50000 va 0xf7f50000 pages 0xb0 attr 0x8
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2019 OpenBSD. All rights reserved.  https://www.OpenBSD.org

OpenBSD 6.6 (RAMDISK) #1: Sun Oct  6 19:02:29 EDT 2019
    [email protected]:/sys/arch/arm64/compile/RAMDISK
real mem  = 4093136896 (3903MB)
avail mem = 3891367936 (3711MB)
mainbus0 at root: Pine64 RockPro64
cpu0 at mainbus0 mpidr 0: ARM Cortex-A53 r0p4
cpu0: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache
cpu0: 512KB 64b/line 16-way L2 cache
efi0 at mainbus0: UEFI 2.8
efi0: Das U-Boot rev 0x20191000
psci0 at mainbus0: PSCI 1.1, SMCCC 1.1
agintc0 at mainbus0 sec shift 3:3 nirq 288 nredist 6: "interrupt-controller"
agintcmsi0 at agintc0
syscon0 at mainbus0: "qos"
syscon1 at mainbus0: "qos"
syscon2 at mainbus0: "qos"
syscon3 at mainbus0: "qos"
syscon4 at mainbus0: "qos"
syscon5 at mainbus0: "qos"
syscon6 at mainbus0: "qos"
syscon7 at mainbus0: "qos"
syscon8 at mainbus0: "qos"
syscon9 at mainbus0: "qos"
syscon10 at mainbus0: "qos"
syscon11 at mainbus0: "qos"
syscon12 at mainbus0: "qos"
syscon13 at mainbus0: "qos"
syscon14 at mainbus0: "qos"
syscon15 at mainbus0: "qos"
syscon16 at mainbus0: "qos"
syscon17 at mainbus0: "qos"
syscon18 at mainbus0: "qos"
syscon19 at mainbus0: "qos"
syscon20 at mainbus0: "qos"
syscon21 at mainbus0: "qos"
syscon22 at mainbus0: "qos"
syscon23 at mainbus0: "qos"
syscon24 at mainbus0: "qos"
syscon25 at mainbus0: "power-management"
"power-controller" at syscon25 not configured
syscon26 at mainbus0: "syscon"
"io-domains" at syscon26 not configured
syscon27 at mainbus0: "syscon"
syscon28 at mainbus0: "syscon"
rkclock0 at mainbus0
rkclock1 at mainbus0
syscon29 at mainbus0: "syscon"
"io-domains" at syscon29 not configured
"usb2-phy" at syscon29 not configured
"usb2-phy" at syscon29 not configured
"phy" at syscon29 not configured
"pcie-phy" at syscon29 not configured
rkpinctrl0 at mainbus0: "pinctrl"
rkgpio0 at rkpinctrl0
rkgpio1 at rkpinctrl0
rkgpio2 at rkpinctrl0
rkgpio3 at rkpinctrl0
rkgpio4 at rkpinctrl0
"fit-images" at mainbus0 not configured
"pmu_a53" at mainbus0 not configured
"pmu_a72" at mainbus0 not configured
agtimer0 at mainbus0: tick rate 24000 KHz
"xin24m" at mainbus0 not configured
simplebus0 at mainbus0: "amba"
"dma-controller" at simplebus0 not configured
"dma-controller" at simplebus0 not configured
rkpcie0 at mainbus0
pci0 at rkpcie0
ppb0 at pci0 dev 0 function 0 "Rockchip RK3399 Root Complex" rev 0x00: msi
pci1 at ppb0 bus 1
ahci0 at pci1 dev 0 function 0 "ASMedia ASM1061 AHCI" rev 0x01: msi, AHCI 1.2
ahci0: port 1: 6.0Gb/s
scsibus0 at ahci0: 32 targets
sd0 at scsibus0 targ 1 lun 0: <ATA, Samsung SSD 840, EXT0> naa.50025388a003c28f
sd0: 476940MB, 512 bytes/sector, 976773168 sectors, thin
dwge0 at mainbus0: address 12:e7:22:42:f7:96
rgephy0 at dwge0 phy 0: RTL8169S/8110S/8211 PHY, rev. 6
dwmmc0 at mainbus0: 50 MHz base clock
sdmmc0 at dwmmc0: 4-bit, sd high-speed, mmc high-speed, dma
sdhc0 at mainbus0
sdhc0: SDHC 3.0, 200 MHz base clock
sdmmc1 at sdhc0: 8-bit, sd high-speed, mmc high-speed, dma
ehci0 at mainbus0
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 
addr 1
ohci0 at mainbus0: version 1.0
ehci1 at mainbus0
usb1 at ehci1: USB revision 2.0
uhub1 at usb1 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 
addr 1
ohci1 at mainbus0: version 1.0
rkdwusb0 at mainbus0: "usb"
xhci0 at rkdwusb0, xHCI 1.10
usb2 at xhci0: USB revision 3.0
uhub2 at usb2 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 
addr 1
rkdwusb1 at mainbus0: "usb"
xhci1 at rkdwusb1, xHCI 1.10
usb3 at xhci1: USB revision 3.0
uhub3 at usb3 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 
addr 1
"saradc" at mainbus0 not configured
rkiic0 at mainbus0
iic0 at rkiic0
rkiic1 at mainbus0
iic1 at rkiic1
com0 at mainbus0: ns16550, no working fifo
com1 at mainbus0: ns16550, no working fifo
com1: console
"thermal-zones" at mainbus0 not configured
"tsadc" at mainbus0 not configured
rkiic2 at mainbus0
iic2 at rkiic2
rkpmic0 at iic2 addr 0x1b: RK808
"silergy,syr827" at iic2 addr 0x40 not configured
"silergy,syr828" at iic2 addr 0x41 not configured
rkiic3 at mainbus0
iic3 at rkiic3
fusbtc0 at iic3 addr 0x22
"pwm" at mainbus0 not configured
"pwm" at mainbus0 not configured
"dmc" at mainbus0 not configured
"efuse" at mainbus0 not configured
"phy" at mainbus0 not configured
"phy" at mainbus0 not configured
"watchdog" at mainbus0 not configured
"rktimer" at mainbus0 not configured
"i2s" at mainbus0 not configured
"i2s" at mainbus0 not configured
"i2s" at mainbus0 not configured
"vop" at mainbus0 not configured
"iommu" at mainbus0 not configured
"vop" at mainbus0 not configured
"iommu" at mainbus0 not configured
"hdmi-sound" at mainbus0 not configured
"hdmi" at mainbus0 not configured
"gpu" at mainbus0 not configured
"opp-table0" at mainbus0 not configured
"opp-table1" at mainbus0 not configured
"opp-table2" at mainbus0 not configured
"external-gmac-clock" at mainbus0 not configured
"gpio-keys" at mainbus0 not configured
"leds" at mainbus0 not configured
"sdio-pwrseq" at mainbus0 not configured
"vcc12v-dcin" at mainbus0 not configured
"vcc1v8-s3" at mainbus0 not configured
"vcc3v3-pcie-regulator" at mainbus0 not configured
"vcc3v3-sys" at mainbus0 not configured
"vcc5v0-host-regulator" at mainbus0 not configured
"vcc5v0-typec-regulator" at mainbus0 not configured
"vcc5v0-sys" at mainbus0 not configured
"vcc5v0-usb" at mainbus0 not configured
"vdd-log" at mainbus0 not configured
usb4 at ohci0: USB revision 1.0
uhub4 at usb4 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 
addr 1
usb5 at ohci1: USB revision 1.0
uhub5 at usb5 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 
addr 1
scsibus1 at sdmmc0: 2 targets, initiator 0
sd1 at scsibus1 targ 1 lun 0: <SD/MMC, SD16G, 0020> removable
sd1: 29862MB, 512 bytes/sector, 61157376 sectors
sdmmc1: can't enable card
softraid0 at root
scsibus2 at softraid0: 256 targets
bootfile: sd0a:bsd.rd
boot device: sd0
root on rd0a swap on rd0b dump on rd0b
panic: cannot open disk, 0x1100/0x2f02, error 2
syncing disks... done

dump to dev 17,1 not possible
rebooting...

Reply via email to