[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

Matthew Ruffell Mon, 07 Jun 2021 22:16:20 -0700

Hi Thimo,

The kernel team have built all of the kernels for this SRU cycle, and
have placed them into -proposed for verification.


We now need to do some thorough testing and make sure that Raid10 arrays
function with good performance, ensure data integrity and make sure we
won't be introducing any regressions when these kernels are released in
two weeks time.

I would really appreciate it if you could help test and verify these
kernels function as intended.

Instructions to Install:

1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release 
-cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update

For 21.04 / Hirsute:

3) sudo apt install linux-image-5.11.0-20-generic 
linux-modules-5.11.0-20-generic \
linux-modules-extra-5.11.0-20-generic linux-headers-5.11.0-20-generic

For 20.10 / Groovy:

3) sudo apt install linux-image-5.8.0-56-generic linux-modules-5.8.0-56-generic 
\
linux-modules-extra-5.8.0-56-generic linux-headers-5.8.0-56-generic

For 20.04 / Focal:

3) sudo apt install linux-image-5.4.0-75-generic linux-modules-5.4.0-75-generic 
\
linux-modules-extra-5.4.0-75-generic linux-headers-5.4.0-75-generic

For 18.04 / Bionic:
 For the 5.4 Bionic HWE Kernel:
 
 3) sudo apt install linux-image-5.4.0-75-generic 
linux-modules-5.4.0-75-generic \
linux-modules-extra-5.4.0-75-generic linux-headers-5.4.0-75-generic
 
 For the 4.15 Bionic GA Kernel:
 
 3) sudo apt install linux-image-4.15.0-145-generic 
linux-modules-4.15.0-145-generic \
linux-modules-extra-4.15.0-145-generic linux-headers-4.15.0-145-generic


4) sudo reboot
5) uname -rv

You may need to modify your grub configuration to boot the correct
kernel. If you need help, read these instructions:
https://paste.ubuntu.com/p/XrTzWPPnWJ/

I am running the -proposed kernel on my cloud instance with my /home directory 
on a Raid10 array made up of 4x NVMe devices, and things are looking okay.
I will be performing my detailed regression testing against these kernels 
tomorrow, and I will write back with the results then.

Please help test these kernels in -proposed, and let me know how they
go.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Invalid
Status in linux source package in Xenial:
  Invalid
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released
Status in linux source package in Groovy:
  Fix Released

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
      mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
      pvcreate -ff -y /dev/md0
      vgcreate -f -y VolGroup /dev/md0
      lvcreate -n root    -L 100G -ay -y VolGroup
      mkfs.ext4 /dev/VolGroup/root
      mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
      dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
      sync
      rm /mnt/data.raw
   - Check the RAID device
      echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
      cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
      fstrim /mnt
   - Re-Check the RAID device
      echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*10000):
      cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
    mdadm --assemble /dev/md127 /dev/nvme0n1p2
    mdadm --run /dev/md127
    fsck.ext4 -n -f /dev/VolGroup/root

    vgchange -a n /dev/VolGroup
    mdadm --stop /dev/md127

    mdadm --assemble /dev/md127 /dev/nvme1n1p2
    mdadm --run /dev/md127
    fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

Reply via email to