Source: linux
Followup-For: Bug #1141183
X-Debbugs-Cc: [email protected]

I've worked with Chris in Matrix Debian room over the last two days to
diagnose this issue. Initial report was I/O errors after kernel upgrade
from v6.12.90 to v7.0.10 with:

[22396.952764] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[22396.952791] ata2.00: failed command: WRITE DMA EXT
[22396.952799] ata2.00: cmd 35/00:b8:02:30:0b/00:1d:00:00:00/e0 tag 26 dma 
3895296 out
                        res 50/00:00:01:20:0b/00:00:00:00:00/e0 Emask 0x40 
(internal error)
[22396.952822] ata2.00: status: { DRDY }

We initially focused on the C220 series chipset and the SATA links and 
determined this HP
ProLiant ML10 v2 (J10) server's specific chipset is C222 - this means that of 
the 6 SATA
ports 2 are 6Gbps and 4 3Gbps.

After checking sysfs attributes for the links and Chris trying alternate
ports with no improvement we next considered the several BIOS firmware
bugs reported by the kernel but discounted those since they exist in the
good kernel versions too. BIOS was upgraded to latest with no
difference.

I found b.k.o #220693 "SATA bus goes offline after a while" [0] that has similar
symptoms and has several intermingled issues that seem to exhibit very
similar symptoms. It discusses several possible workarounds:

1. ATA LPM ( libata.force=nolpm )
2. maximum transfer size ( libata.force=maxsec1024 )
3. IOMMU ( iommu=off )

The first two didn't show improvements on the failing kernel versions
but disabling IOMMU did. Comment 44 [1] of the bug report considers that may
indicate an issue with readahead code.

Using this as a clue and looking through v6.12..v7.0 commits I found a
group of related iomap commits introduced at the start of the v6.19
cycle.

That spurred the test of v6.18 that doesn't exhibit the issue, and
v6.19 that does.

I've built and shared v6.18.37 with Chris since this does not contain
the suspect iomap commits and he'll report back on that later.

Bisecting the iomap commits in v6.19 might be a challenge due to how
there are several later changes based on the series, and they might be a
red herring.

So this needs more eyeballs to consider alternative triggers of the bug
and once we have the v6.18.37 result forwarding upstream.

Tj.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=220693

[1] https://bugzilla.kernel.org/show_bug.cgi?id=220693#c44

Reply via email to