Hi Kai-Heng, thanks for your response.

I followed your advice and installed these packages from the ppa:
linux-headers-5.3.0-050300rc5_5.3.0-050300rc5.201908182231_all.deb
linux-headers-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-image-unsigned-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-modules-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb

I then rebooted the system and checked versions with "uname -a" and "modinfo 
mpt3sas":
(kernel 5.3.0) (mpt3sas driver 29.100.00.00)

The stress test has now been running for 4h with no errors, which is 8x as long 
as the previous best on 18.04.
I will leave the stress test running overnight in the event that the bug still 
exists but occurs less frequently.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to