I am no longer able to reproduce after applying Ted's patch. I was able
to run my unit test 20 times w/o failing on upstream + patch on d05-6. I
then switched over to the Ubuntu kernel + patch, and it has now passed
64 times (and counting).

I then looked to see why Ike is still observing a failure. My theory is
that the filesystem Ike was testing was already corrupted by a previous
*unpatched* run, so the kernel is finding pre-existing corruption.
Evidence follows.

The last record of "sudo mkfs.ext4" running in /var/log/auth.log:

Jul  5 06:43:39 d05-4 sudo:   ubuntu : TTY=ttyAMA0 ; PWD=/home/ubuntu ;
USER=root ; COMMAND=/sbin/mkfs.ext4 /dev/sda2

While the kernel w/ the fix wasn't built until Jul 9:
[    0.000000] Linux version 4.15.0-25-generic (root@recht) (gcc version
 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)) #27+ext4msg61578.1 SMP Mon Jul 9 
08:28:49 UTC 2018 (Ubuntu 4.1
5.0-25.27+ext4msg61578.1-generic 4.15.18)

The first time /dev/sda2 was mounted after booting this kernel, it reported 
known errors:
Jul  9 05:43:33 d05-4 kernel: [  138.522140] EXT4-fs (sda2): warning: mounting 
fs with errors, running e2fsck is recommended

Looking at the conserver log (logs all console activity on this system), it 
looks like the test used did not reformat the disk between iterations:
root@d05-4:~# ^G^G^G^G^G^G^G^G^G^G^G^Gwhile true; do sudo 
/usr/lib/plainbox-provider-checkbox/bin/disk
_st^Mtress_ng sda --base-time 240 --really-run; done

Finally, I manually ran mkfs.ext4 on /dev/sda2. Afterwards, both my unit
test
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137/comments/5)
and the full disk_stress_ng cert test Ike was running passed without
error.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1780137

Title:
  [Regression] EXT4-fs error (device sda1):
  ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Bionic:
  Triaged

Bug description:
  We're seeing a very reproducible regression in the bionic kernel
  triggered by the stress-ng chdir test performed by the Ubuntu
  certification suite. We see this on both the HiSilicon D05 arm64
  server and the HiSilicon D06 arm64 server. We have been unable to
  reproduce on other servers so far.

  [Test Case]
  $ sudo apt-add-repository -y ppa:hardware-certification/public
  $ sudo apt install -y canonical-certification-server
  $ sudo mkfs.ext4 /dev/sda1 (Obviously, this should not be your root disk!!)
  $ sudo /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng sda --base-time 
240 --really-run

  This test runs a series of stress-ng tests against /dev/sda, and fails
  on the "chdir" test. To speed up reproduction, reduce the test list to
  just "chdir" in the disk_stress_ng script. Attempts to reproduce this
  directly with stress-ng have failed - presumably because of other
  environment setup that this script performs (e.g. setting aio-max-nr
  to 524288).

  Our reproduction test is to use a non-root disk because it can lead to
  corruption, and mkfs.ext4'ing the partition just before running the
  test, to get to a pristine fs state.

  I bisected this down to the following commit:

  commit 555bc9b1421f10d94a1192c7eea4a59faca3e711
  Author: Theodore Ts'o <ty...@mit.edu>
  Date:   Mon Feb 19 14:16:47 2018 -0500

      ext4: don't update checksum of new initialized bitmaps

      BugLink: http://bugs.launchpad.net/bugs/1773233

      commit 044e6e3d74a3d7103a0c8a9305dfd94d64000660 upstream.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to