On Sun, 2019-10-06 at 09:20 +1000, Jonathan Matthew wrote: > On Fri, Oct 04, 2019 at 06:24:16PM -0400, [email protected] wrote: > > > > > > > > Synopsis: panic after ahci0: log page read failed, slot 31 was still > > > active > > > Category: kernel > > > Environment: > > System : OpenBSD 6.6 > > Details : OpenBSD 6.6-beta (GENERIC.MP) #245: Sat Sep 28 20:43:51 > > MDT 2019 > > > > [email protected]:/usr/src/sys/arch/arm64/compile/GENERIC.MP > > > > Architecture: OpenBSD.arm64 > > Machine : arm64 > > > > > > Description: > > While building lang/rust received some ahci0 messages then a panic. > > System is a RockPro64 with 4G memory. filesystem root is on uSD, the > > rest of the paritions are on SSD (including swap). > > > > > > How-To-Repeat: > > Not sure if reproducable yet. I was building lang/rust with > > ulimit -Sd 4194304. > > > > > > Fix: > > Unknown > > > > > > ahci0: log page read failed, slot 31 was still active. > > ahci0: device didn't come ready after reset, TFD: 0x84c1<BSY,ERR> > > panic: uvm_fault failed: ffffff80003475b8 > So, the 'log page read' happens when there are multiple commands active and > the > port reports an error. The log page read is supposed to tell us which command > failed. If that fails, we fail all active commands, and if the device won't > reset after that, we shut the device off and all further io will fail. If > that's your swap device, and the system is swapping, you're kind of screwed. > > Perhaps disabling command queueing (which means there can only be one command > in flight, so no need to read the log page on errors) might help? The diff > below should do that. Checking the SSD out with smartctl is probably also a > good idea at this point. > > > diff --git sys/dev/pci/ahci_pci.c sys/dev/pci/ahci_pci.c > index 79044b52dd5..f61ae96b0cf 100644 > --- sys/dev/pci/ahci_pci.c > +++ sys/dev/pci/ahci_pci.c > @@ -108,7 +108,8 @@ static const struct ahci_device ahci_devices[] = { > { PCI_VENDOR_ATI, PCI_PRODUCT_ATI_SBX00_SATA_6, > NULL, ahci_ati_sb700_attach }, > > - { PCI_VENDOR_ASMEDIA, PCI_PRODUCT_ASMEDIA_ASM1061_SATA }, > + { PCI_VENDOR_ASMEDIA, PCI_PRODUCT_ASMEDIA_ASM1061_SATA, > + NULL, ahci_vt8251_attach }, > > { PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_6SERIES_AHCI_1, > NULL, ahci_intel_attach }, >
Thank you for the diff to test. I noticed that I can easily reproduce the problem using sysupgrade. >50% of the time the problem will trip there. My /home partition is on the SSD, so it is reading and writing from it while installing the sets. I also have reproduced the problem using bsd.rd and using http for source of the sets with less frequency of tripping the error. Seems like the ramdisk environment increases the chances of tripping the issue. While running in multiuser bsd.mp it is much more difficult to reproduce. I have only been able to trip it one more time using fio with a random read/write test but lots of heavy ports building use has not tripped the issue. I applied your diff and built RAMDISK to test it. However, the boot failed with this (full boot message later on): root on rd0a swap on rd0b dump on rd0b panic: cannot open disk, 0x1100/0x2f02, error 2 syncing disks... done smartmontools seems to indicate the SSD isn't failing but I'm not that familiar with the output: smartctl -a /dev/sd0c smartctl 7.0 2018-12-30 r4883 [aarch64-unknown-openbsd6.6] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 840 EVO 500GB Serial Number: S1DHNSAD921017A LU WWN Device Id: 5 002538 8a003c28f Firmware Version: EXT0BB0Q User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Oct 6 18:31:59 2019 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Status not supported: Incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 32) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete Offline data collection: ( 6600) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 110) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 21750 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 399 177 Wear_Leveling_Count 0x0013 098 098 000 Pre-fail Always - 16 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 075 054 000 Old_age Always - 25 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 13 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 154 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 21761092635 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Interrupted (host reset) 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Full boot messages with diff applied to bsd.rd: U-Boot SPL 2019.10-rc4 (Oct 03 2019 - 19:40:31 -0400) Trying to boot from MMC1 NOTICE: BL31: v2.1(debug):2.1 NOTICE: BL31: Built : 10:16:34, Sep 27 2019 INFO: GICv3 with legacy support detected. ARM GICV3 driver initialized in EL3 INFO: plat_rockchip_pmu_init(1596): pd status 3e INFO: BL31: Initializing runtime services WARNING: BL31: cortex_a53: CPU workaround for 819472 was missing! WARNING: BL31: cortex_a53: CPU workaround for 824069 was missing! WARNING: BL31: cortex_a53: CPU workaround for 827319 was missing! INFO: BL31: cortex_a53: CPU workaround for 855873 was applied INFO: BL31: Preparing for EL3 exit to normal world INFO: Entry point address = 0x200000 INFO: SPSR = 0x3c9 U-Boot 2019.10-rc4 (Oct 03 2019 - 19:40:31 -0400) Model: Pine64 RockPro64 DRAM: 3.9 GiB MMC: dwmmc@fe320000: 1, sdhci@fe330000: 0 Loading Environment from MMC... Card did not respond to voltage select! *** Warning - No block device, using default environment In: serial@ff1a0000 Out: serial@ff1a0000 Err: serial@ff1a0000 Model: Pine64 RockPro64 rockchip_dnl_key_pressed: adc_channel_single_shot fail! Net: eth0: ethernet@fe300000 Hit any key to stop autoboot: 0 Card did not respond to voltage select! switch to partitions #0, OK mmc1 is current device Scanning mmc 1:1... Found EFI removable media binary efi/boot/bootaa64.efi libfdt fdt_check_header(): FDT_ERR_BADMAGIC Scanning disk [email protected]... Card did not respond to voltage select! Scanning disk [email protected]... Disk [email protected] not ready Found 3 disks BootOrder not defined EFI boot manager: Cannot load any image 161090 bytes read in 46 ms (3.3 MiB/s) libfdt fdt_check_header(): FDT_ERR_BADMAGIC disks: sd0* >> OpenBSD/arm64 BOOTAA64 0.19 boot> bsd.rd booting sd0a:bsd.rd: 2224220+622316+8769504+739568 [221760+109+519552+200640]=0xff7c80 type 0x2 pa 0x200000 va 0x200000 pages 0x4000 attr 0x8 type 0x7 pa 0x4200000 va 0x4200000 pages 0x3eec attr 0x8 type 0x4 pa 0x80ec000 va 0x80ec000 pages 0x28 attr 0x8 type 0x7 pa 0x8114000 va 0x8114000 pages 0xec099 attr 0x8 type 0x2 pa 0xf41ad000 va 0xf41ad000 pages 0xc3b attr 0x8 type 0x4 pa 0xf4de8000 va 0xf4de8000 pages 0x1 attr 0x8 type 0x2 pa 0xf4de9000 va 0xf4de9000 pages 0x3 attr 0x8 type 0x7 pa 0xf4dec000 va 0xf4dec000 pages 0x1 attr 0x8 type 0x2 pa 0xf4ded000 va 0xf4ded000 pages 0x100 attr 0x8 type 0x1 pa 0xf4eed000 va 0xf4eed000 pages 0x28 attr 0x8 type 0x0 pa 0xf4f15000 va 0xf4f15000 pages 0x7 attr 0x8 type 0x4 pa 0xf4f1c000 va 0xf4f1c000 pages 0x1 attr 0x8 type 0x6 pa 0xf4f1d000 va 0x8cb8de000 pages 0x1 attr 0x8000000000000008 type 0x4 pa 0xf4f1e000 va 0xf4f1e000 pages 0x2 attr 0x8 type 0x0 pa 0xf4f20000 va 0xf4f20000 pages 0x4 attr 0x8 type 0x4 pa 0xf4f24000 va 0xf4f24000 pages 0x2 attr 0x8 type 0x6 pa 0xf4f26000 va 0x8cb8e7000 pages 0x1 attr 0x8000000000000008 type 0x2 pa 0xf4f27000 va 0xf4f27000 pages 0x3019 attr 0x8 type 0x5 pa 0xf7f40000 va 0x8ce901000 pages 0x10 attr 0x8000000000000008 type 0x2 pa 0xf7f50000 va 0xf7f50000 pages 0xb0 attr 0x8 Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. Copyright (c) 1995-2019 OpenBSD. All rights reserved. https://www.OpenBSD.org OpenBSD 6.6 (RAMDISK) #1: Sun Oct 6 19:02:29 EDT 2019 [email protected]:/sys/arch/arm64/compile/RAMDISK real mem = 4093136896 (3903MB) avail mem = 3891367936 (3711MB) mainbus0 at root: Pine64 RockPro64 cpu0 at mainbus0 mpidr 0: ARM Cortex-A53 r0p4 cpu0: 32KB 64b/line 2-way L1 VIPT I-cache, 32KB 64b/line 4-way L1 D-cache cpu0: 512KB 64b/line 16-way L2 cache efi0 at mainbus0: UEFI 2.8 efi0: Das U-Boot rev 0x20191000 psci0 at mainbus0: PSCI 1.1, SMCCC 1.1 agintc0 at mainbus0 sec shift 3:3 nirq 288 nredist 6: "interrupt-controller" agintcmsi0 at agintc0 syscon0 at mainbus0: "qos" syscon1 at mainbus0: "qos" syscon2 at mainbus0: "qos" syscon3 at mainbus0: "qos" syscon4 at mainbus0: "qos" syscon5 at mainbus0: "qos" syscon6 at mainbus0: "qos" syscon7 at mainbus0: "qos" syscon8 at mainbus0: "qos" syscon9 at mainbus0: "qos" syscon10 at mainbus0: "qos" syscon11 at mainbus0: "qos" syscon12 at mainbus0: "qos" syscon13 at mainbus0: "qos" syscon14 at mainbus0: "qos" syscon15 at mainbus0: "qos" syscon16 at mainbus0: "qos" syscon17 at mainbus0: "qos" syscon18 at mainbus0: "qos" syscon19 at mainbus0: "qos" syscon20 at mainbus0: "qos" syscon21 at mainbus0: "qos" syscon22 at mainbus0: "qos" syscon23 at mainbus0: "qos" syscon24 at mainbus0: "qos" syscon25 at mainbus0: "power-management" "power-controller" at syscon25 not configured syscon26 at mainbus0: "syscon" "io-domains" at syscon26 not configured syscon27 at mainbus0: "syscon" syscon28 at mainbus0: "syscon" rkclock0 at mainbus0 rkclock1 at mainbus0 syscon29 at mainbus0: "syscon" "io-domains" at syscon29 not configured "usb2-phy" at syscon29 not configured "usb2-phy" at syscon29 not configured "phy" at syscon29 not configured "pcie-phy" at syscon29 not configured rkpinctrl0 at mainbus0: "pinctrl" rkgpio0 at rkpinctrl0 rkgpio1 at rkpinctrl0 rkgpio2 at rkpinctrl0 rkgpio3 at rkpinctrl0 rkgpio4 at rkpinctrl0 "fit-images" at mainbus0 not configured "pmu_a53" at mainbus0 not configured "pmu_a72" at mainbus0 not configured agtimer0 at mainbus0: tick rate 24000 KHz "xin24m" at mainbus0 not configured simplebus0 at mainbus0: "amba" "dma-controller" at simplebus0 not configured "dma-controller" at simplebus0 not configured rkpcie0 at mainbus0 pci0 at rkpcie0 ppb0 at pci0 dev 0 function 0 "Rockchip RK3399 Root Complex" rev 0x00: msi pci1 at ppb0 bus 1 ahci0 at pci1 dev 0 function 0 "ASMedia ASM1061 AHCI" rev 0x01: msi, AHCI 1.2 ahci0: port 1: 6.0Gb/s scsibus0 at ahci0: 32 targets sd0 at scsibus0 targ 1 lun 0: <ATA, Samsung SSD 840, EXT0> naa.50025388a003c28f sd0: 476940MB, 512 bytes/sector, 976773168 sectors, thin dwge0 at mainbus0: address 12:e7:22:42:f7:96 rgephy0 at dwge0 phy 0: RTL8169S/8110S/8211 PHY, rev. 6 dwmmc0 at mainbus0: 50 MHz base clock sdmmc0 at dwmmc0: 4-bit, sd high-speed, mmc high-speed, dma sdhc0 at mainbus0 sdhc0: SDHC 3.0, 200 MHz base clock sdmmc1 at sdhc0: 8-bit, sd high-speed, mmc high-speed, dma ehci0 at mainbus0 usb0 at ehci0: USB revision 2.0 uhub0 at usb0 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 addr 1 ohci0 at mainbus0: version 1.0 ehci1 at mainbus0 usb1 at ehci1: USB revision 2.0 uhub1 at usb1 configuration 1 interface 0 "Generic EHCI root hub" rev 2.00/1.00 addr 1 ohci1 at mainbus0: version 1.0 rkdwusb0 at mainbus0: "usb" xhci0 at rkdwusb0, xHCI 1.10 usb2 at xhci0: USB revision 3.0 uhub2 at usb2 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 addr 1 rkdwusb1 at mainbus0: "usb" xhci1 at rkdwusb1, xHCI 1.10 usb3 at xhci1: USB revision 3.0 uhub3 at usb3 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 addr 1 "saradc" at mainbus0 not configured rkiic0 at mainbus0 iic0 at rkiic0 rkiic1 at mainbus0 iic1 at rkiic1 com0 at mainbus0: ns16550, no working fifo com1 at mainbus0: ns16550, no working fifo com1: console "thermal-zones" at mainbus0 not configured "tsadc" at mainbus0 not configured rkiic2 at mainbus0 iic2 at rkiic2 rkpmic0 at iic2 addr 0x1b: RK808 "silergy,syr827" at iic2 addr 0x40 not configured "silergy,syr828" at iic2 addr 0x41 not configured rkiic3 at mainbus0 iic3 at rkiic3 fusbtc0 at iic3 addr 0x22 "pwm" at mainbus0 not configured "pwm" at mainbus0 not configured "dmc" at mainbus0 not configured "efuse" at mainbus0 not configured "phy" at mainbus0 not configured "phy" at mainbus0 not configured "watchdog" at mainbus0 not configured "rktimer" at mainbus0 not configured "i2s" at mainbus0 not configured "i2s" at mainbus0 not configured "i2s" at mainbus0 not configured "vop" at mainbus0 not configured "iommu" at mainbus0 not configured "vop" at mainbus0 not configured "iommu" at mainbus0 not configured "hdmi-sound" at mainbus0 not configured "hdmi" at mainbus0 not configured "gpu" at mainbus0 not configured "opp-table0" at mainbus0 not configured "opp-table1" at mainbus0 not configured "opp-table2" at mainbus0 not configured "external-gmac-clock" at mainbus0 not configured "gpio-keys" at mainbus0 not configured "leds" at mainbus0 not configured "sdio-pwrseq" at mainbus0 not configured "vcc12v-dcin" at mainbus0 not configured "vcc1v8-s3" at mainbus0 not configured "vcc3v3-pcie-regulator" at mainbus0 not configured "vcc3v3-sys" at mainbus0 not configured "vcc5v0-host-regulator" at mainbus0 not configured "vcc5v0-typec-regulator" at mainbus0 not configured "vcc5v0-sys" at mainbus0 not configured "vcc5v0-usb" at mainbus0 not configured "vdd-log" at mainbus0 not configured usb4 at ohci0: USB revision 1.0 uhub4 at usb4 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 addr 1 usb5 at ohci1: USB revision 1.0 uhub5 at usb5 configuration 1 interface 0 "Generic OHCI root hub" rev 1.00/1.00 addr 1 scsibus1 at sdmmc0: 2 targets, initiator 0 sd1 at scsibus1 targ 1 lun 0: <SD/MMC, SD16G, 0020> removable sd1: 29862MB, 512 bytes/sector, 61157376 sectors sdmmc1: can't enable card softraid0 at root scsibus2 at softraid0: 256 targets bootfile: sd0a:bsd.rd boot device: sd0 root on rd0a swap on rd0b dump on rd0b panic: cannot open disk, 0x1100/0x2f02, error 2 syncing disks... done dump to dev 17,1 not possible rebooting...
