Patch could be found in Bionic-gcp master branch and Eoan, mark this one
as Fix-released.

** Changed in: linux (Ubuntu)
       Status: Incomplete => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1842271

Title:
  Infinite loop in __blkdev_issue_discard() consumes the entire system
  memory when formatting a raid array

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1842271

  [Impact]

  There is a regression present in block in 4.15.0-56, which causes the
  machine to consume all system memory and cause the OOM reaper to come
  out, when attempting to format a newly created raid array on a series
  of NVMe cluster with an xfs filesystem.

  The symptoms can be found with all system memory ending up in
  kmalloc-256 slab, with a nondescript call trace being printed.

  The problem is caused by the below two commits:

  commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34
  Author: Ming Lei <ming....@redhat.com>
  Date:   Fri Oct 12 15:53:10 2018 +0800
  Subject: block: don't deal with discard limit in blkdev_issue_discard()
  BugLink: https://bugs.launchpad.net/bugs/1836802
  Upstream-commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714
  Pastebin: https://paste.ubuntu.com/p/WKsmwKgCMc/

  commit b515257f186e532e0668f7deabcb04b5d27505cf
  Author: Ming Lei <ming....@redhat.com>
  Date:   Mon Oct 29 20:57:17 2018 +0800
  Subject: block: make sure discard bio is aligned with logical block size
  BugLink: https://bugs.launchpad.net/bugs/1836802
  Upstream-commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7
  Pastebin: https://paste.ubuntu.com/p/TcnTck64PJ/

  Now, the fault was triggered in two stages. Firstly, in "block: don't
  deal with discard limit in blkdev_issue_discard()" a while loop was
  changed such that there is a possibility of an infinite loop if
  __blkdev_issue_discard() is called with nr_sects > 0 and req_sects
  somehow becomes 0:

  int __blkdev_issue_discard(..., sector_t nr_sects, ...)
  {
  ...
  while (nr_sects) {
      unsigned int req_sects = nr_sects;
      sector_t end_sect;

      end_sect = sector + req_sects;
  ...
      nr_sects -= req_sects;
      sector = end_sect;
  ...
  }

  if req_sects is 0, then end_sect is always equal to sector, and the
  most important part, nr_sects is only decremented in one place, by
  req_sects, which if 0, would lead to the infinite loop condition.

  Now, since req_sects is initially equal to nr_sects, the loop would
  never be entered in the first place if nr_sects is 0.

  This is where the second commit, "block: make sure discard bio is
  aligned with logical block size" comes in.

  This commit adds a line to the above loop, to allow req_sects to be
  set to a new value:

  int __blkdev_issue_discard(..., sector_t nr_sects, ...)
  {
  ...
  while (nr_sects) {
      unsigned int req_sects = nr_sects;
      sector_t end_sect;

      req_sects = min(req_sects, bio_allowed_max_sectors(q));

      end_sect = sector + req_sects;
  ...
      nr_sects -= req_sects;
      sector = end_sect;
  ...
  }

  We see that req_sects will now be the minimum of itself and
  bio_allowed_max_sectors(q), a new function introduced by the same
  commit.

  static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
  {
         return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
  }

  queue_logical_block_size(q) queries the hardware for the logical block
  size of the underlying device.

  static inline unsigned short queue_logical_block_size(struct request_queue *q)
  {
      int retval = 512;

      if (q && q->limits.logical_block_size)
          retval = q->limits.logical_block_size;

      return retval;
  }

  if q->limits.logical_block_size is 512 or smaller, then bit shifted
  right by 9 yields 0.

  bio_allowed_max_sectors will return 0, and the min with req_sects ==
  nr_sects will favour the new 0.

  This causes nr_sects to never be decremented since req_sects is 0, and
  req_sects will never change since the min() that takes in itself will
  always favour the 0.

  From there the infinite loop iterates and fills up the kmalloc-256
  slab with newly created bio entries, until all memory is exhausted and
  the OOM reaper comes out and starts killing processes, which is
  ineffective since this is a kernel memory leak.

  [Fix]

  The fix comes in the form of:

  commit a55264933f12c2fdc28a66841c4724021e8c1caf
  Author: Mikulas Patocka <mpato...@redhat.com>
  Date:   Tue Jul 3 13:34:22 2018 -0400
  Subject: block: fix infinite loop if the device loses discard capability
  BugLink: https://bugs.launchpad.net/bugs/1837257
  Upstream-commit: b88aef36b87c9787a4db724923ec4f57dfd513f3
  Pastebin: https://paste.ubuntu.com/p/bgC9vQnB8x/

  This adds a check right after the min(req_sects,
  bio_allowed_max_sectors(q)); to test if req_sects has been set to 0,
  and if it has, to exit the loop and move into failure handling:

  ...
  req_sects = min(req_sects, bio_allowed_max_sectors(q));
  if (!req_sects)
      goto fail;
  ...

  From there things work as normal. As "block: fix infinite loop if the
  device loses discard capability" points out, all of this is triggered
  due to a race where if underlying device is reloaded with a metadata
  table that doesn't support the discard operation, then
  q->limits.max_discard_sectors is set to 0 and has a knock on effect of
  setting q->limits.logical_block_size to strange values and leads to
  the infinite loop and out of memory condition.

  Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp
  4.15.0-1041, which is currently sitting in -proposed.

  [Testcase]

  We need a machine which can have a series of NVMe drives attached to
  it, and I used a default instance on Google Cloud Platform.

  Select the 18.04 image, and add a dsik to the system. Select from the
  dropbox local scratch storage, then NVMe based storage, and then 8
  disks.

  Start the instance. Make sure it is running linux-gcp-1040 or linux-
  generic 4.15.0-56.

  Create a raid array with:

  # mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 
/dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 
/dev/nvme0n8
  # mkfs.xfs -f /dev/md0

  The call to mkfs.xfs will consume all memory, and the ssh session will
  be disconnected since sshd gets killed by the oom reaper.

  If you reconnect, and look at dmesg, all system memory will be in
  kmalloc-256.

  This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041,
  which is currently in -proposed.

  If you install the above kernel and retest, things go along as
  expected.

  [Regression Potential]

  The fix was accepted in upstream -stable and is a direct fix to the
  two commits which caused the issues, meaning there is a low
  probability of any new regressions being added.

  All changes are limited to discarding on block devices, and while a
  fairly core part of the kernel, the changes are small and are focused
  on the 0 condition.

  In case of regression, simply revert all three offending commits.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1842271/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to