This was released in 4.15.0-59-generic and works as intended. Marking
Fix Released. Note, this still needs to land in linux-gcp for Bionic due
to some technical reasons.
** Changed in: linux (Ubuntu Bionic)
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1842271
Title:
Infinite loop in __blkdev_issue_discard() consumes the entire system
memory when formatting a raid array
Status in linux package in Ubuntu:
Incomplete
Status in linux source package in Bionic:
Fix Released
Bug description:
BugLink: https://bugs.launchpad.net/bugs/1842271
[Impact]
There is a regression present in block in 4.15.0-56, which causes the
machine to consume all system memory and cause the OOM reaper to come
out, when attempting to format a newly created raid array on a series
of NVMe cluster with an xfs filesystem.
The symptoms can be found with all system memory ending up in
kmalloc-256 slab, with a nondescript call trace being printed.
The problem is caused by the below two commits:
commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34
Author: Ming Lei <[email protected]>
Date: Fri Oct 12 15:53:10 2018 +0800
Subject: block: don't deal with discard limit in blkdev_issue_discard()
BugLink: https://bugs.launchpad.net/bugs/1836802
Upstream-commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714
Pastebin: https://paste.ubuntu.com/p/WKsmwKgCMc/
commit b515257f186e532e0668f7deabcb04b5d27505cf
Author: Ming Lei <[email protected]>
Date: Mon Oct 29 20:57:17 2018 +0800
Subject: block: make sure discard bio is aligned with logical block size
BugLink: https://bugs.launchpad.net/bugs/1836802
Upstream-commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7
Pastebin: https://paste.ubuntu.com/p/TcnTck64PJ/
Now, the fault was triggered in two stages. Firstly, in "block: don't
deal with discard limit in blkdev_issue_discard()" a while loop was
changed such that there is a possibility of an infinite loop if
__blkdev_issue_discard() is called with nr_sects > 0 and req_sects
somehow becomes 0:
int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
unsigned int req_sects = nr_sects;
sector_t end_sect;
end_sect = sector + req_sects;
...
nr_sects -= req_sects;
sector = end_sect;
...
}
if req_sects is 0, then end_sect is always equal to sector, and the
most important part, nr_sects is only decremented in one place, by
req_sects, which if 0, would lead to the infinite loop condition.
Now, since req_sects is initially equal to nr_sects, the loop would
never be entered in the first place if nr_sects is 0.
This is where the second commit, "block: make sure discard bio is
aligned with logical block size" comes in.
This commit adds a line to the above loop, to allow req_sects to be
set to a new value:
int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
unsigned int req_sects = nr_sects;
sector_t end_sect;
req_sects = min(req_sects, bio_allowed_max_sectors(q));
end_sect = sector + req_sects;
...
nr_sects -= req_sects;
sector = end_sect;
...
}
We see that req_sects will now be the minimum of itself and
bio_allowed_max_sectors(q), a new function introduced by the same
commit.
static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
{
return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
}
queue_logical_block_size(q) queries the hardware for the logical block
size of the underlying device.
static inline unsigned short queue_logical_block_size(struct request_queue *q)
{
int retval = 512;
if (q && q->limits.logical_block_size)
retval = q->limits.logical_block_size;
return retval;
}
if q->limits.logical_block_size is 512 or smaller, then bit shifted
right by 9 yields 0.
bio_allowed_max_sectors will return 0, and the min with req_sects ==
nr_sects will favour the new 0.
This causes nr_sects to never be decremented since req_sects is 0, and
req_sects will never change since the min() that takes in itself will
always favour the 0.
From there the infinite loop iterates and fills up the kmalloc-256
slab with newly created bio entries, until all memory is exhausted and
the OOM reaper comes out and starts killing processes, which is
ineffective since this is a kernel memory leak.
[Fix]
The fix comes in the form of:
commit a55264933f12c2fdc28a66841c4724021e8c1caf
Author: Mikulas Patocka <[email protected]>
Date: Tue Jul 3 13:34:22 2018 -0400
Subject: block: fix infinite loop if the device loses discard capability
BugLink: https://bugs.launchpad.net/bugs/1837257
Upstream-commit: b88aef36b87c9787a4db724923ec4f57dfd513f3
Pastebin: https://paste.ubuntu.com/p/bgC9vQnB8x/
This adds a check right after the min(req_sects,
bio_allowed_max_sectors(q)); to test if req_sects has been set to 0,
and if it has, to exit the loop and move into failure handling:
...
req_sects = min(req_sects, bio_allowed_max_sectors(q));
if (!req_sects)
goto fail;
...
From there things work as normal. As "block: fix infinite loop if the
device loses discard capability" points out, all of this is triggered
due to a race where if underlying device is reloaded with a metadata
table that doesn't support the discard operation, then
q->limits.max_discard_sectors is set to 0 and has a knock on effect of
setting q->limits.logical_block_size to strange values and leads to
the infinite loop and out of memory condition.
Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp
4.15.0-1041, which is currently sitting in -proposed.
[Testcase]
We need a machine which can have a series of NVMe drives attached to
it, and I used a default instance on Google Cloud Platform.
Select the 18.04 image, and add a dsik to the system. Select from the
dropbox local scratch storage, then NVMe based storage, and then 8
disks.
Start the instance. Make sure it is running linux-gcp-1040 or linux-
generic 4.15.0-56.
Create a raid array with:
# mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1
/dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7
/dev/nvme0n8
# mkfs.xfs -f /dev/md0
The call to mkfs.xfs will consume all memory, and the ssh session will
be disconnected since sshd gets killed by the oom reaper.
If you reconnect, and look at dmesg, all system memory will be in
kmalloc-256.
This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041,
which is currently in -proposed.
If you install the above kernel and retest, things go along as
expected.
[Regression Potential]
The fix was accepted in upstream -stable and is a direct fix to the
two commits which caused the issues, meaning there is a low
probability of any new regressions being added.
All changes are limited to discarding on block devices, and while a
fairly core part of the kernel, the changes are small and are focused
on the 0 condition.
In case of regression, simply revert all three offending commits.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1842271/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp