Checking if this snippet is properly spaced in a LP comment text
(instead of Description.)
Task 1 / CPU 1 Task 2 / CPU 2
do_huge_pmd_numa_page() do_huge_pmd_numa_page()
- pmd_lock() .
- trylock_page() // PageLocked = true .
. .
- spin_unlock() .
. - pmd_lock()
. - pmd_trans_migrating() //
PageLocked == true
. - spin_unlock()
- migrate_misplaced_transhuge_page() .
- pmd_lock() .
- pmdp_clear_flush() // PMD = NULL .
. - wait_migrate_huge_page()
. - page = pmd_page() // PMD ==
NULL ... page = <bogus>
. - wait_on_page_locked(page) //
BUG()
. < pagefault handler in bad
state >
. < that userspace process is
hung >
- set_pmd_at() // PMD = non-NULL
- spin_unlock()
** Description changed:
[Impact]
- * Users on NUMA systems (mostly servers) with
- NUMA balancing enabled (which is by default)
- might hit a crash/BUG() on a race condition
- if two simultaneous page faults of the same
- transparent hugepage go into the path for
- migration to another NUMA node.
-
- * The symptom is BUG() for 0xffffeaffffffffc0,
- which happens if the PMD is set to zero/NULL.
-
- BUG: unable to handle kernel paging request at ffffeaffffffffc0
- IP: [<ffffffff811b3d31>] wait_migrate_huge_page+0x51/0x70
-
- * NUMA balancing periodically unmaps pages so to
- force page faults to occur, and later find out
- using page faults where the NUMA memory access
- is coming from - if it's often from other NUMA
- node, it attempts to migrate the page contents
- to the other NUMA node (for more local access.)
-
- * The race condition is related to these 3 functions
- in the pagefault handling of transparent hugepages:
-
- do_huge_pmd_numa_page() -> wait_migrate_huge_page()
- do_huge_pmd_numa_page() -> migrate_misplaced_transhuge_page()
-
- The first task to hit the pagefault / migration path
- calls migrate_misplaced_transhuge_page(), which does:
-
- - ptl = pmd_lock(mm, pmd) // calls spin_lock(ptl)
- - pmdp_clear_flush(..., pmd); // set PMD to zero
- - set_pmd_at(..., pmd, ...); // set PMD to non-zero
- - spin_unlock(ptl);
-
- The second task to hit that path finds that the page
- is already being migrated (page is locked) and waits
- for that finish (ie, until page is unlocked), doing:
-
- - spin_unlock(ptl)
- - page = pmd_page(*pmd)
- - wait_on_page_locked(page)
-
- *BUT* it reads the PMD value *after* unlocking stuff.
-
- So, if the tasks/CPUs manage to run in the sequence
- below the PMD can be set to zero/NULL by first task
- and read by second task before it's set to non-NULL.
- Thus the second task miscalculates the page pointer,
- from PMD and hit BUG for address 0xffffeaffffffffc0.
-
- Task 1 / CPU 1 Task 2 / CPU 2
-
- do_huge_pmd_numa_page() do_huge_pmd_numa_page()
- - pmd_lock() .
- - trylock_page() // PageLocked = true .
- . .
- - spin_unlock() .
- . - pmd_lock()
- . - pmd_trans_migrating() //
PageLocked == true
- . - spin_unlock()
- - migrate_misplaced_transhuge_page() .
- - pmd_lock() .
- - pmdp_clear_flush() // PMD = NULL .
- . - wait_migrate_huge_page()
- . - page = pmd_page() // PMD ==
NULL ... page = <bogus>
- . - wait_on_page_locked(page)
// BUG()
- . < pagefault handler in bad
state >
- . < that userspace process is
hung >
- - set_pmd_at() // PMD = non-NULL
- - spin_unlock()
-
- * The fix just moves pmd_page() before spin_unlock(),
- and now the change perfomed in the other function
- (done within the spin_lock()/spin_unlock() region)
- can no longer run concurrently with this PMD read.
-
- * So, when the other function releases the spin lock,
- the PMD has already been set to non-NULL/valid PMD,
- and wait_on_page_locked() receives a valid address.
-
- * Fix commit 5d833062139d ("mm: numa: do not dereference
- pmd outside of the lock during NUMA hinting fault") [1]
-
- * Applied in v4.0 upstream; only Trusty/3.13 needs it.
-
- $ git describe --contains 5d833062139d
- v4.0-rc1~98^2~103
-
- <PENDING>
-
- [1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d833062139d
-
-
- Kernel oops occurs randomly every now and then, seemingly when running
memory-intensive processes (so far, it happened to me when using bowtie2 or
STAR).
+ * Users on NUMA systems (mostly servers) with
+ NUMA balancing enabled (which is by default)
+ might hit a crash/BUG() on a race condition
+ if two simultaneous page faults of the same
+ transparent hugepage go into the path for
+ migration to another NUMA node.
+
+ * The symptom is BUG() for 0xffffeaffffffffc0,
+ which happens if the PMD is set to zero/NULL.
+
+ BUG: unable to handle kernel paging request at ffffeaffffffffc0
+ IP: [<ffffffff811b3d31>] wait_migrate_huge_page+0x51/0x70
+
+ * NUMA balancing periodically unmaps pages so to
+ force page faults to occur, and later find out
+ using page faults where the NUMA memory access
+ is coming from - if it's often from other NUMA
+ node, it attempts to migrate the page contents
+ to the other NUMA node (for more local access.)
+
+ * The race condition is related to these 3 functions
+ in the pagefault handling of transparent hugepages:
+
+ do_huge_pmd_numa_page() -> wait_migrate_huge_page()
+ do_huge_pmd_numa_page() -> migrate_misplaced_transhuge_page()
+
+ The first task to hit the pagefault / migration path
+ calls migrate_misplaced_transhuge_page(), which does:
+
+ - ptl = pmd_lock(mm, pmd) // calls spin_lock(ptl)
+ - pmdp_clear_flush(..., pmd); // set PMD to zero
+ - set_pmd_at(..., pmd, ...); // set PMD to non-zero
+ - spin_unlock(ptl);
+
+ The second task to hit that path finds that the page
+ is already being migrated (page is locked) and waits
+ for that finish (ie, until page is unlocked), doing:
+
+ - spin_unlock(ptl)
+ - page = pmd_page(*pmd)
+ - wait_on_page_locked(page)
+
+ *BUT* it reads the PMD value *after* unlocking stuff.
+
+ So, if the tasks/CPUs manage to run in the sequence
+ below the PMD can be set to zero/NULL by first task
+ and read by second task before it's set to non-NULL.
+ Thus the second task miscalculates the page pointer,
+ from PMD and hit BUG for address 0xffffeaffffffffc0.
+
+ Task 1 / CPU 1 Task 2 / CPU 2
+
+ do_huge_pmd_numa_page() do_huge_pmd_numa_page()
+ - pmd_lock() .
+ - trylock_page() // PageLocked = true .
+ . .
+ - spin_unlock() .
+ . - pmd_lock()
+ . - pmd_trans_migrating() //
PageLocked == true
+ . - spin_unlock()
+ - migrate_misplaced_transhuge_page() .
+ - pmd_lock() .
+ - pmdp_clear_flush() // PMD = NULL .
+ . - wait_migrate_huge_page()
+ . - page = pmd_page() // PMD ==
NULL ... page = <bogus>
+ . - wait_on_page_locked(page)
// BUG()
+ . < pagefault handler in bad
state >
+ . < that userspace process is
hung >
+ - set_pmd_at() // PMD = non-NULL
+ - spin_unlock()
+
+ * The fix just moves pmd_page() before spin_unlock(),
+ and now the change perfomed in the other function
+ (done within the spin_lock()/spin_unlock() region)
+ can no longer run concurrently with this PMD read.
+
+ * So, when the other function releases the spin lock,
+ the PMD has already been set to non-NULL/valid PMD,
+ and wait_on_page_locked() receives a valid address.
+
+ * Fix commit 5d833062139d ("mm: numa: do not dereference
+ pmd outside of the lock during NUMA hinting fault") [1]
+
+ * Applied in v4.0 upstream; only Trusty/3.13 needs it.
+
+ $ git describe --contains 5d833062139d
+ v4.0-rc1~98^2~103
+
+
+ [Test Case]
+
+ * A synthetic reproducer is available for this bug.
+ It consists of two parts:
+
+ 1) userspace program to allocate one transparent
+ huge page and spawn two threads that write to
+ it periodically - based on the period used by
+ numa balancing to unmap pages - so it trigger
+ page faults simultaneously on the two threads.
+
+ 2) kernel module that inserts kprobes on several
+ points in the involved functions, in order to
+ synchronize the two threads's timing/progress
+ in the way that reproduces the problem.
+
+ Then the user forces the NUMA migration to occur
+ by changing the affinity of userspace program to
+ another NUMA node, and the kernel module/kprobes
+ force the problem to occur.
+
+ * Steps done with kprobes:
+ 1) Task 1 and Task 2 join at do_huge_pmd_numa_page().
+ 2) Task 1 moves on to migrate_misplaced_transhuge_page(),
+ signals Task 2 to move on, and waits for Task 2 signal.
+ 3) Task 2 moves on to point in 'if pmd_trans_migrating()'
+ just after spin_unlock() [before pmd_page() originally],
+ signals Task 1 to move on, and waits for Task 1 signal.
+ 4) Task 1 moves on until it sets the PMD to zero/NULL,
+ signals Task 2 to move on, and waits for Task 2 signal.
+ 5) Task 2 moves on to read the PMD and calculate its page,
+ and attempt to wait on it to be unlocked.
+ 5.1) BUG() on the original kernel / without the patch.
+ 5.2) Go-on on the modified kernel / with the patch.
+
+ [Regression Potential]
+
+ * Low. The fix is both really targeted at this particular
+ problem/race condition (see last paragraph in commit msg)
+ and is sufficiently contained to fix/change _just_ that
+ (despite the long commit message for intro to a series.)
+
+ * The only change is actually move pmd_page() before
+ spin_unlock() (rest is removing a wrapper function),
+ and wait_on_page_locked() remains after spin_unlock().
+
+ * The pmd_page() code path does not acquire any locks,
+ so not susceptible to a dead-lock or interplay with
+ another lock.
+
+ [Original Description]
+
+ [1]
+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d833062139d
+
+ Kernel oops occurs randomly every now and then, seemingly when running
+ memory-intensive processes (so far, it happened to me when using bowtie2
+ or STAR).
Running Ubuntu 14.04 LTS on AWS EC2 instances (m4.* and c4.* family
classes). After the error occurs, the server stays accessible through
SSH, but the commands w, htop, ps (and maybe others) seem to hang, while
commands like ls, cd, top and others keep working. Whatever process was
running and (probably) caused the crash seems to go into a sleeping
mode.
Rebooting (sudo reboot) makes the instance refuse all connections (more
than an hour after initiating the reboot). Stopping the (AWS EC2)
instance and starting again makes the instance function normally again.
Restarting the task that was running when the instance crashed on the newly
(re)started instance usually works with no more problems.
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Jan 23 12:49 seq
crw-rw---- 1 root audio 116, 33 Jan 23 12:49 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.29
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
'/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
Ec2AMI: ami-4473183b
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m4.16xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize
libusb: -99
MachineType: Xen HVM domU
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-164-generic
root=UUID=d4f2aafc-946a-4514-930d-4c45e676f198 ro console=tty1 console=ttyS0
ProcVersionSignature: Ubuntu 3.13.0-164.214-generic 3.13.11-ckt39
RelatedPackageVersions:
linux-restricted-modules-3.13.0-164-generic N/A
linux-backports-modules-3.13.0-164-generic N/A
linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty ec2-images
Uname: Linux 3.13.0-164-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: sudo
WifiSyslog:
_MarkForUpload: True
dmi.bios.date: 08/24/2006
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias:
dmi:bvnXen:bvr4.2.amazon:bd08/24/2006:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1813018
Title:
Kernel Oops - unable to handle kernel paging request; RIP is at
wait_migrate_huge_page+0x51/0x70
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1813018/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs