Disco currently has the mpt3sas fix in disco-proposed (version 5.0.0-30.32),
also available as linux-hwe kernel in bionic-proposed.
** Changed in: linux (Ubuntu Disco)
Status: In Progress => Fix Committed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132
Title:
mpt3sas - storage controller resets under heavy disk io
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Bionic:
Fix Released
Status in linux source package in Disco:
Fix Committed
Status in linux source package in Eoan:
Fix Released
Bug description:
[summary]
when a server running ubuntu 18.04 with an lsi sas controller experiences
high disk io there is a chance the storage controller will reset
this can take weeks or months, but once the controller resets it will keep
resetting every few seconds or few minutes, dramatically degrading disk io
the server must be rebooted to restore the controller to a normal state
[hardware configuration]
server: dell poweredge r7415, purchased 2019-02
cpu/chipset: amd epyc naples
storage controller: "dell hba330 mini" with chipset "lsi sas3008"
drives: 4x samsung 860 pro 2TB ssd
[software configuration]
ubuntu 18.04 server
mdadm raid6
all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)
[what happened]
server was operating as a vm host for months without issue
one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag
reset !!" and "Power-on or device reset occurred", along with unusably-slow
disk io
the server was removed from production and I looked for a way to reproduce
the issue
[how to reproduce the issue]
there are probably many ways to product this issue, the hackish way I found
to reliably reproduce it was:
have the four ssds in a mdadm raid6 with ext4 filesystem
create three 500GB files containing random data
open three terminals. one calculates md5sum of file1 in a loop, another does
the same for file2, the third does a copy of file3 to file3-temp in a loop
the number of files is arbitrary, the goal is just to generate a lot of disk
io on files too large to be cached in memory
then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause
even more drive thrashing
within 1-15min the controller will enter the broken state. the longest I ever
saw it take was 30min. I reproduced this several times
rebooting the server restores the controller to a normal state
if the server is not rebooted and the controller is left in this broken state
eventually drives will fall out of the array, and sometimes array/filesystem
corruption will occur
[why this is being reported here]
It's unlikely I am exceeding limits of the hardware since this server chassis
can hold 24 drives and I am only using 4. The controller specs indicate I
should not hit pcie bandwidth limits until at least 16 drives.
My first thought was that the lsi controller firmware was at fault since they
have been historically buggy, however I reproduced this with the newest
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be
dell-specific).
I then tried the most recent motherboard bios "1.9.3", and downgraded to
"1.9.2", no change.
I then wanted to eliminate the possibility of a bad drive. swapped out all 4
drives with different ones of the same model, no change.
I then upgraded from the standard 18.04 kernel to the newer backported hwe
kernel, which also came with a newer mpt3sas driver, no change.
I then ran the same test on the same array but with rhel 8, to my surprise I
could no longer reproduce the issue.
-
tl;dr version:
ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller
breaks in 1-10min
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage
controller breaks in 1-15min, max 30min
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same
array for 19h, no errors
[caveats]
Server os misconfiguration is possible, however this is a rather basic vm
host running kvm and no 3rd-party packages.
I can't conclusively prove this isn't a hardware fault since I don't have a
second unused identical server to test on right now, however the fact that the
problem can be easily reproduced under ubuntu but not under rhel seems
noteworthy.
There is another bug (LP: #1810781) similar to this, I didn't post there
because it's already marked as fixed.
There is also a debian bug (Debian #926202) that encountered this on kernel
4.19.0, but I'm unable to tell if it's the same issue.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp