ata1.00: failed command: WRITE FPDMA QUEUED on new AMD AM4 MSI B350 Motherboard

2017-07-07 Thread Mark Hounschell
With both 4.11 and 4.12 kernels I get the following when doing heavy disk I/O, 
like a kernel build with "make -j 15". Even copying the kernel source tree from 
one place to another. The hardware is an MSI B350 Tomahawk Arctic MB with 16GB 
of memory and a Ryzen 1700 processor. The disk being used is a 160Gb Seagate 
ST3160815AS that has error free media according to "badblocks -w".

Jul  6 13:34:43 cpu0 kernel: ata1.00: exception Emask 0x11 SAct 0x7ffb SErr 
0x40 action 0x6 frozen
Jul  6 13:34:43 cpu0 kernel: ata1.00: irq_stat 0x4808, interface fatal error
Jul  6 13:34:43 cpu0 kernel: ata1: SError: { Handshk }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:00:57:89:90/00:00:03:00:00/40 
tag 0 ncq dma 4096 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:08:87:89:90/00:00:03:00:00/40 
tag 1 ncq dma 4096 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/20:10:97:89:90/00:00:03:00:00/40 
tag 2 ncq dma 16384 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)

When I set the kernel cmdline option libata.force=noncq, the messages change 
into:

[ 1724.372101] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40 action 0x6 
frozen
[ 1724.375888] ata1.00: irq_stat 0x4801, interface fatal error
[ 1724.379721] ata1: SError: { Handshk }
[ 1724.383691] ata1.00: failed command: WRITE DMA EXT
[ 1724.383695] ata1.00: cmd 35/00:50:67:0d:e4/00:09:02:00:00/e0 tag 10 dma 
1220608 out
res 51/84:50:67:0d:e4/00:09:02:00:00/e0 Emask 0x10 (ATA 
bus error)
[ 1724.383699] ata1.00: status: { DRDY ERR }
[ 1724.383700] ata1.00: error: { ICRC ABRT }
[ 1724.383706] ata1: hard resetting link
[ 1724.850060] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1724.959883] ata1.00: configured for UDMA/133
[ 1724.959910] ata1: EH complete
[ 1921.704356] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40 action 0x6 
frozen
[ 1921.708292] ata1.00: irq_stat 0x4801, interface fatal error
[ 1921.712210] ata1: SError: { Handshk }
[ 1921.716294] ata1.00: failed command: WRITE DMA EXT
[ 1921.716297] ata1.00: cmd 35/00:90:ef:93:86/00:03:02:00:00/e0 tag 18 dma 
466944 out
res 51/84:90:ef:93:86/00:03:02:00:00/e0 Emask 0x10 (ATA 
bus error)
[ 1921.716298] ata1.00: status: { DRDY ERR }
[ 1921.716298] ata1.00: error: { ICRC ABRT }
[ 1921.716303] ata1: hard resetting link
[ 1922.175312] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1922.284165] ata1.00: configured for UDMA/133
[ 1922.288602] ata1: EH complete


smartctl shows no issues with the drive. In fact I can take this very drive 
and install it an an AM3 machine and everything works just fine. I have 
also installed a PCI-e Sata card and connected the drive to that and that
works just fine also. 

So I have either a linux kernel problem or a hardware problem on 
this brand new AM4 motherboard. I don't really know what it 
is other than it is something related with the AMD B350 chipset.

It is a fairly new chip set so I am suspicious of the kernel. 

# smartctl -a  /dev/sda
smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.11.6-lcrs] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number:6RACD737
Firmware Version: 4.AAB
User Capacity:160,041,885,696 bytes [160 GB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:Fri Jul  7 13:50:50 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(  430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection 

ata1.00: failed command: WRITE FPDMA QUEUED on new AMD AM4 MSI B350 Motherboard

2017-07-07 Thread Mark Hounschell
With both 4.11 and 4.12 kernels I get the following when doing heavy disk I/O, 
like a kernel build with "make -j 15". Even copying the kernel source tree from 
one place to another. The hardware is an MSI B350 Tomahawk Arctic MB with 16GB 
of memory and a Ryzen 1700 processor. The disk being used is a 160Gb Seagate 
ST3160815AS that has error free media according to "badblocks -w".

Jul  6 13:34:43 cpu0 kernel: ata1.00: exception Emask 0x11 SAct 0x7ffb SErr 
0x40 action 0x6 frozen
Jul  6 13:34:43 cpu0 kernel: ata1.00: irq_stat 0x4808, interface fatal error
Jul  6 13:34:43 cpu0 kernel: ata1: SError: { Handshk }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:00:57:89:90/00:00:03:00:00/40 
tag 0 ncq dma 4096 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:08:87:89:90/00:00:03:00:00/40 
tag 1 ncq dma 4096 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/20:10:97:89:90/00:00:03:00:00/40 
tag 2 ncq dma 16384 out
 res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)

When I set the kernel cmdline option libata.force=noncq, the messages change 
into:

[ 1724.372101] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40 action 0x6 
frozen
[ 1724.375888] ata1.00: irq_stat 0x4801, interface fatal error
[ 1724.379721] ata1: SError: { Handshk }
[ 1724.383691] ata1.00: failed command: WRITE DMA EXT
[ 1724.383695] ata1.00: cmd 35/00:50:67:0d:e4/00:09:02:00:00/e0 tag 10 dma 
1220608 out
res 51/84:50:67:0d:e4/00:09:02:00:00/e0 Emask 0x10 (ATA 
bus error)
[ 1724.383699] ata1.00: status: { DRDY ERR }
[ 1724.383700] ata1.00: error: { ICRC ABRT }
[ 1724.383706] ata1: hard resetting link
[ 1724.850060] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1724.959883] ata1.00: configured for UDMA/133
[ 1724.959910] ata1: EH complete
[ 1921.704356] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x40 action 0x6 
frozen
[ 1921.708292] ata1.00: irq_stat 0x4801, interface fatal error
[ 1921.712210] ata1: SError: { Handshk }
[ 1921.716294] ata1.00: failed command: WRITE DMA EXT
[ 1921.716297] ata1.00: cmd 35/00:90:ef:93:86/00:03:02:00:00/e0 tag 18 dma 
466944 out
res 51/84:90:ef:93:86/00:03:02:00:00/e0 Emask 0x10 (ATA 
bus error)
[ 1921.716298] ata1.00: status: { DRDY ERR }
[ 1921.716298] ata1.00: error: { ICRC ABRT }
[ 1921.716303] ata1: hard resetting link
[ 1922.175312] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1922.284165] ata1.00: configured for UDMA/133
[ 1922.288602] ata1: EH complete


smartctl shows no issues with the drive. In fact I can take this very drive 
and install it an an AM3 machine and everything works just fine. I have 
also installed a PCI-e Sata card and connected the drive to that and that
works just fine also. 

So I have either a linux kernel problem or a hardware problem on 
this brand new AM4 motherboard. I don't really know what it 
is other than it is something related with the AMD B350 chipset.

It is a fairly new chip set so I am suspicious of the kernel. 

# smartctl -a  /dev/sda
smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.11.6-lcrs] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number:6RACD737
Firmware Version: 4.AAB
User Capacity:160,041,885,696 bytes [160 GB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:Fri Jul  7 13:50:50 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(  430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection