** Changed in: linux (Ubuntu)
       Status: Incomplete => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1680513

Title:
  Migrating KSM page causes the VM lock up as the KSM page merging list
  is too large

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  New
Status in linux source package in Zesty:
  New

Bug description:
  [Impact]
  After numad is enabled and there are several VMs running on the same
  host machine(host kernel version: 4.4.0-72-generic #93), the
  softlockup messages can be observed inside the VMs' dmesg.

  First, the crashdump was captured when the symptom was observed. At
  the first glance, it looks like an IPI lost issue. The numad process
  initiates a migration of memory, and as part of this, needs to flush
  the TLB cache of another CPU. When the crash dump was taken, that
  other CPU has the TLB flush pending, but not executed. 

  The numad kernel task is holding a semaphore lock mmap_sem(for the
  VM's memory) to do the migration, and the tasks that actually end up
  being blocked are other virtual CPUs for the same VM. These tasks need
  to access or make changes to the memory map for the VM because of the
  VM page fault, but cannot acquire the semaphore lock.

  However, the original thoughts on the root cause (unhandled IPI or csd
  lock issue) are incorrect.

  We originally suspected an issue with a lost IPI (inter processor
  interrupt) that performs remote CPU cache flushes during page
  migration, or a known issue with the "csd" lock used to synchronize
  the remote CPU cache flush.  A lost IPI would be a function of the
  system firmware or chipset (it is not a CPU issue), but the known csd
  issue is hardware independent. 

  Gavin created the hotfix kernel with changes in the csd_lock_wait
  function that would time out if the unlock never happens (the end
  result of either cause), and print messages to the console when that
  timeout occurred. The messages look like: 

  csd_lock_wait called %d times

  csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting
  %Ld.%03Ld secs for CPU#%02d

  However, the VMs are still experiencing the hangs, but the
  csd_lock_wait timeout is not happening. This suggests that the csd
  lock / lost IPI is not the actual cause.

  In the crash dump, the numad task has induced a migration, and the
  stack is as follows: 

  #1 [ffff885f8fb4fb78] smp_call_function_many 
  #2 [ffff885f8fb4fbc0] native_flush_tlb_others 
  #3 [ffff885f8fb4fc08] flush_tlb_page 
  #4 [ffff885f8fb4fc30] ptep_clear_flush 
  #5 [ffff885f8fb4fc60] try_to_unmap_one 
  #6 [ffff885f8fb4fcd0] rmap_walk_ksm 
  #7 [ffff885f8fb4fd28] rmap_walk 
  #8 [ffff885f8fb4fd80] try_to_unmap 
  #9 [ffff885f8fb4fdc8] migrate_pages 
  #10 [ffff885f8fb4fe80] do_migrate_pages 

  
  The frame #1 is actually in the csd_lock_wait function mentioned
  above, but the compiler has optimized that call and it does not appear
  in the stack. 

  What happens here is that do_migrate_pages (frame #10) acquires the
  semaphore that everything else is waiting for (and that eventually
  produce the hang warnings), and it holds that semaphore for the
  duration of the page migration.  This strongly suggests that this
  single do_migrate_pages call is taking in excess of 10 seconds, and if
  the csd lock is not stuck, then something else within its call path is
  not functioning correctly. 

  We originally suspected that the lost IPI/csd lock hang was
  responsible for the hung task timeouts, but in the absence of the csd
  warning messages, the cause presumably lies elsewhere. 

  A KSM function appears in frame #6; this is the function that will
  search out the merged pages to handle them for the migration. 

  Gavin have tried to disassemble the code and finally find the 
  stable_node->hlist is as long as 2306920 entries:

  rmap_item list(stable_node->hlist): 
  stable_node: 0xffff881f836ba000 stable_node->hlist->first = 
0xffff883f3e5746b0 

  struct hlist_head { 
  [0] struct hlist_node *first; 
  } 
  struct hlist_node { 
  [0] struct hlist_node *next; 
  [8] struct hlist_node **pprev; 
  } 

  crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst

  $ wc -l rmap_item.lst 
  2306920 rmap_item.lst

  This is roughly 9 GB of pages. The theory is that KSM has merged a
  very large number of pages that are empty (the value of all locations
  in the page are zero).

  The bug can be observed by the perf flame graph[1]:

  [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg

  [Fix]
  Andrea Arcangeli already sent out the patch[2] in the 2015/11/10.
  Andrew Morton also said he will apply the patch. However, the patch
  finally disappears from the mmtom tree in April 2016. Andrea suggested
  apply the 3 patches[3].

  [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page 
  deduplication limit 
  http://www.spinics.net/lists/linux-mm/msg96866.html 

  [3]. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
  deduplication limit
  https://www.spinics.net/lists/linux-mm/msg113829.html

  [Test Case]
  The patches has been tested with 9 VMs and each has 32GB ram and 16
  VCPUs.  Numad/KSM are also enabled in the machine. After running for
  6 days, the system is stable and unstable CPU loading cannot be
  observed inside the virtual appliances monitor[4]. The numad cpu
  utilization rate is normal and guest hang also cannot be observed.

  Machine type: Dell PowerEdge R920
  Memory: 528GB with 4 NUMA nodes
  CPU: 120 cores

  [4].
  http://kernel.ubuntu.com/~gavinguo/sf00131845/virtual_appliances_loading.png

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to