[Kernel-packages] [Bug 1680513] Re: Migrating KSM page causes the VM lock up as the KSM page merging list is too large

Launchpad Bug Tracker Mon, 28 Aug 2017 03:32:09 -0700

This bug was fixed in the package linux - 4.10.0-33.37

---------------
linux (4.10.0-33.37) zesty; urgency=low


  * linux: 4.10.0-33.37 -proposed tracker (LP: #1709303)

  * CVE-2017-1000112
    - Revert "udp: consistently apply ufo or fragmentation"
    - udp: consistently apply ufo or fragmentation

  * CVE-2017-1000111
    - Revert "net-packet: fix race in packet_set_ring on PACKET_RESERVE"
    - packet: fix tp_reserve race in packet_set_ring

  * ThunderX: soft lockup on 4.8+ kernels when running qemu-efi with vhost=on
    (LP: #1673564)
    - irqchip/gic-v3: Add missing system register definitions
    - arm64: KVM: Do not use stack-protector to compile EL2 code
    - KVM: arm/arm64: vgic-v3: Use PREbits to infer the number of ICH_APxRn_EL2
      registers
    - KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
    - arm64: Add a facility to turn an ESR syndrome into a sysreg encoding
    - KVM: arm/arm64: vgic-v3: Add accessors for the ICH_APxRn_EL2 registers
    - KVM: arm64: Make kvm_condition_valid32() accessible from EL2
    - KVM: arm64: vgic-v3: Add hook to handle guest GICv3 sysreg accesses at EL2
    - KVM: arm64: vgic-v3: Add ICV_BPR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IGRPEN1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IAR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_EOIR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_AP1Rn_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_HPPIR1_EL1 handler
    - KVM: arm64: vgic-v3: Enable trapping of Group-1 system registers
    - KVM: arm64: Enable GICv3 Group-1 sysreg trapping via command-line
    - KVM: arm64: vgic-v3: Add ICV_BPR0_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IGNREN0_EL1 handler
    - KVM: arm64: vgic-v3: Add misc Group-0 handlers
    - KVM: arm64: vgic-v3: Enable trapping of Group-0 system registers
    - KVM: arm64: Enable GICv3 Group-0 sysreg trapping via command-line
    - arm64: Add MIDR values for Cavium cn83XX SoCs
    - [Config] CONFIG_CAVIUM_ERRATUM_30115=y
    - arm64: Add workaround for Cavium Thunder erratum 30115
    - KVM: arm64: vgic-v3: Add ICV_DIR_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_RPR_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_CTLR_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_PMR_EL1 handler
    - KVM: arm64: Enable GICv3 common sysreg trapping via command-line
    - KVM: arm64: vgic-v3: Log which GICv3 system registers are trapped
    - arm64: KVM: Make unexpected reads from WO registers inject an undef
    - KVM: arm64: Log an error if trapping a read-from-write-only GICv3 access
    - KVM: arm64: Log an error if trapping a write-to-read-only GICv3 access

  * ibmvscsis: Do not send aborted task response (LP: #1689365)
    - target: Fix unknown fabric callback queue-full errors
    - ibmvscsis: Do not send aborted task response
    - ibmvscsis: Clear left-over abort_cmd pointers
    - ibmvscsis: Fix the incorrect req_lim_delta

  * hisi_sas performance improvements (LP: #1708734)
    - scsi: hisi_sas: define hisi_sas_device.device_id as int
    - scsi: hisi_sas: optimise the usage of hisi_hba.lock
    - scsi: hisi_sas: relocate sata_done_v2_hw()
    - scsi: hisi_sas: optimise DMA slot memory

  * hisi_sas driver reports mistakes timed out task for internal abort
    (LP: #1708730)
    - scsi: hisi_sas: fix timeout check in hisi_sas_internal_task_abort()

  * scsi: hisi_sas: add null check before indirect pointer dereference
    (LP: #1708714)
    - scsi: hisi_sas: add null check before indirect pointer dereference

  * [LTCTest][Opal][FW860.20] HMI recoverable errors failed to recover and
    system goes to dump state. (LP: #1684054)
    - powerpc/64: Fix HMI exception on LE with CONFIG_RELOCATABLE=y

  * Set CONFIG_SATA_HIGHBANK=y on armhf (LP: #1703430)
    - [Config] CONFIG_SATA_HIGHBANK=y

  * Adt tests of src:linux time out often on armhf lxc containers (LP: #1705495)
    - [Packaging] tests -- reduce rebuild test to one flavour

  * support Hip07/08 I2C controller (LP: #1708293)
    - ACPI / APD: Add clock frequency for Hisilicon Hip07/08 I2C controller
    - i2c: designware: Add ACPI HID for Hisilicon Hip07/08 I2C controller

  * Mute key LED does not work on HP ProBook 440 (LP: #1705586)
    - ALSA: hda - Add HP ZBook 15u G3 Conexant CX20724 GPIO mute leds
    - ALSA: hda - Add mute led support for HP ProBook 440 G4

  * Hisilicon D05 onboard fibre NIC link indicator LEDs don't work
    (LP: #1704903)
    - net: hns: add acpi function of xge led control

  * zesty unable to handle kernel NULL pointer dereference (LP: #1680904)
    - drm/i915: Do not drop pagetables when empty

  * hns: use after free in hns_nic_net_xmit_hw (LP: #1704885)
    - net: hns: Fix a skb used after free bug

  * [ARM64] config EDAC_GHES=y depends on EDAC_MM_EDAC=y (LP: #1706141)
    - [Config] set EDAC_MM_EDAC=y for ARM64

  * [Hyper-V] hv_netvsc: Exclude non-TCP port numbers from vRSS hashing
    (LP: #1690174)
    - hv_netvsc: Exclude non-TCP port numbers from vRSS hashing

  * ath10k doesn't report full RSSI information (LP: #1706531)
    - ath10k: add per chain RSSI reporting

  * ideapad_laptop don't support v310-14isk (LP: #1705378)
    - platform/x86: ideapad-laptop: Add several models to no_hw_rfkill

  * hns: ethtool selftest crashes system (LP: #1705712)
    - net/hns:bugfix of ethtool -t phy self_test

  * ath9k freezes suspend resume Ubuntu 17.04 (LP: #1697027)
    - ath9k: fix an invalid pointer dereference in ath9k_rng_stop()

  * xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2
    comp_code 13 (LP: #1667750)
    - xhci: Bad Ethernet performance plugged in ASM1042A host

  * Migrating KSM page causes the VM lock up as the KSM page merging list is too
    large (LP: #1680513)
    - ksm: introduce ksm_max_page_sharing per page deduplication limit
    - ksm: fix use after free with merge_across_nodes = 0
    - ksm: cleanup stable_node chain collapse case
    - ksm: swap the two output parameters of chain/chain_prune
    - ksm: optimize refile of stable_node_dup at the head of the chain

  * Change CONFIG_IBMVETH to module (LP: #1704479)
    - [Config] CONFIG_IBMVETH=m

  * CVE-2017-7487
    - ipx: call ipxitf_put() in ioctl error path

  * Hotkeys on new Thinkpad systems aren't working (LP: #1705169)
    - platform/x86: thinkpad_acpi: guard generic hotkey case
    - platform/x86: thinkpad_acpi: add mapping for new hotkeys

  * misleading kernel warning skb_warn_bad_offload during checksum calculation
    (LP: #1705447)
    - net: reduce skb_warn_bad_offload() noise

  * Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
    (LP: #1692538)
    - ibmveth: Support to enable LSO/CSO for Trunk VEA.

  * bonding: stack dump when unregistering a netdev (LP: #1704102)
    - bonding: avoid NETDEV_CHANGEMTU event when unregistering slave

  * Ubuntu 16.04 IOB Error when the Mustang board rebooted (LP: #1693673)
    - drivers: net: xgene: Fix redundant prefetch buffer cleanup

  * Ubuntu16.04: NVMe 4K+T10 DIF/DIX format returns I/O error on dd with split
    op (LP: #1689946)
    - blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split
      op

  * linux >= 4.2: bonding 802.3ad does not work with 5G, 25G and 50G link speeds
    (LP: #1697892)
    - bonding: add 802.3ad support for 25G speeds
    - bonding: fix 802.3ad support for 5G and 50G speeds

  * [SRU][Zesty] arm64: Add support for handling memory corruption
    (LP: #1696852)
    - arm64: mm: Update perf accounting to handle poison faults
    - arm64: hugetlb: Fix huge_pte_offset to return poisoned page table entries
    - arm64: kconfig: allow support for memory failure handling
    - arm64: hwpoison: add VM_FAULT_HWPOISON[_LARGE] handling

  * [SRU][Zesty] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
    (LP: #1696570)
    - acpi: apei: read ack upon ghes record consumption
    - ras: acpi/apei: cper: add support for generic data v3 structure
    - cper: add timestamp print to CPER status printing
    - efi: parse ARM processor error
    - arm64: exception: handle Synchronous External Abort
    - acpi: apei: handle SEA notification type for ARMv8
    - acpi: apei: panic OS with fatal error status block
    - efi: print unrecognized CPER section
    - ras: acpi / apei: generate trace event for unrecognized CPER section
    - trace, ras: add ARM processor error trace event
    - ras: mark stub functions as 'inline'
    - arm/arm64: KVM: add guest SEA support
    - acpi: apei: check for pending errors when probing GHES entries
    - [Config] CONFIG_ACPI_APEI_SEA=y

 -- Stefan Bader <stefan.ba...@canonical.com>  Fri, 11 Aug 2017 11:40:30
+0200

** Changed in: linux (Ubuntu Zesty)
       Status: Fix Committed => Fix Released

** CVE added: https://cve.mitre.org/cgi-
bin/cvename.cgi?name=2017-1000111

** CVE added: https://cve.mitre.org/cgi-
bin/cvename.cgi?name=2017-1000112

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2017-7487

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1680513

Title:
  Migrating KSM page causes the VM lock up as the KSM page merging list
  is too large

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Zesty:
  Fix Released

Bug description:
  [Impact]
  After numad is enabled and there are several VMs running on the same
  host machine(host kernel version: 4.4.0-72-generic #93), the
  softlockup messages can be observed inside the VMs' dmesg.

  First, the crashdump was captured when the symptom was observed. At
  the first glance, it looks like an IPI lost issue. The numad process
  initiates a migration of memory, and as part of this, needs to flush
  the TLB cache of another CPU. When the crash dump was taken, that
  other CPU has the TLB flush pending, but not executed. 

  The numad kernel task is holding a semaphore lock mmap_sem(for the
  VM's memory) to do the migration, and the tasks that actually end up
  being blocked are other virtual CPUs for the same VM. These tasks need
  to access or make changes to the memory map for the VM because of the
  VM page fault, but cannot acquire the semaphore lock.

  However, the original thoughts on the root cause (unhandled IPI or csd
  lock issue) are incorrect.

  We originally suspected an issue with a lost IPI (inter processor
  interrupt) that performs remote CPU cache flushes during page
  migration, or a known issue with the "csd" lock used to synchronize
  the remote CPU cache flush.  A lost IPI would be a function of the
  system firmware or chipset (it is not a CPU issue), but the known csd
  issue is hardware independent. 

  Gavin created the hotfix kernel with changes in the csd_lock_wait
  function that would time out if the unlock never happens (the end
  result of either cause), and print messages to the console when that
  timeout occurred. The messages look like: 

  csd_lock_wait called %d times

  csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting
  %Ld.%03Ld secs for CPU#%02d

  However, the VMs are still experiencing the hangs, but the
  csd_lock_wait timeout is not happening. This suggests that the csd
  lock / lost IPI is not the actual cause.

  In the crash dump, the numad task has induced a migration, and the
  stack is as follows: 

  #1 [ffff885f8fb4fb78] smp_call_function_many 
  #2 [ffff885f8fb4fbc0] native_flush_tlb_others 
  #3 [ffff885f8fb4fc08] flush_tlb_page 
  #4 [ffff885f8fb4fc30] ptep_clear_flush 
  #5 [ffff885f8fb4fc60] try_to_unmap_one 
  #6 [ffff885f8fb4fcd0] rmap_walk_ksm 
  #7 [ffff885f8fb4fd28] rmap_walk 
  #8 [ffff885f8fb4fd80] try_to_unmap 
  #9 [ffff885f8fb4fdc8] migrate_pages 
  #10 [ffff885f8fb4fe80] do_migrate_pages 

  
  The frame #1 is actually in the csd_lock_wait function mentioned
  above, but the compiler has optimized that call and it does not appear
  in the stack. 

  What happens here is that do_migrate_pages (frame #10) acquires the
  semaphore that everything else is waiting for (and that eventually
  produce the hang warnings), and it holds that semaphore for the
  duration of the page migration.  This strongly suggests that this
  single do_migrate_pages call is taking in excess of 10 seconds, and if
  the csd lock is not stuck, then something else within its call path is
  not functioning correctly. 

  We originally suspected that the lost IPI/csd lock hang was
  responsible for the hung task timeouts, but in the absence of the csd
  warning messages, the cause presumably lies elsewhere. 

  A KSM function appears in frame #6; this is the function that will
  search out the merged pages to handle them for the migration. 

  Gavin have tried to disassemble the code and finally find the 
  stable_node->hlist is as long as 2306920 entries:

  rmap_item list(stable_node->hlist): 
  stable_node: 0xffff881f836ba000 stable_node->hlist->first = 
0xffff883f3e5746b0 

  struct hlist_head { 
  [0] struct hlist_node *first; 
  } 
  struct hlist_node { 
  [0] struct hlist_node *next; 
  [8] struct hlist_node **pprev; 
  } 

  crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst

  $ wc -l rmap_item.lst 
  2306920 rmap_item.lst

  This is roughly 9 GB of pages. The theory is that KSM has merged a
  very large number of pages that are empty (the value of all locations
  in the page are zero).

  The bug can be observed by the perf flame graph[1]:

  [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg

  [Fix]
  Andrea Arcangeli already sent out the patch[2] in the 2015/11/10.
  Andrew Morton also said he will apply the patch. However, the patch
  finally disappears from the mmtom tree in April 2016. Andrea suggested
  apply the 3 patches[3].

  [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page 
  deduplication limit 
  http://www.spinics.net/lists/linux-mm/msg96866.html 

  [3]. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
  deduplication limit
  https://www.spinics.net/lists/linux-mm/msg113829.html

  [Test Case]
  The patches has been tested with 9 VMs and each has 32GB ram and 16
  VCPUs.  Numad/KSM are also enabled in the machine. After running for
  6 days, the system is stable and unstable CPU loading cannot be
  observed inside the virtual appliances monitor[4]. The numad cpu
  utilization rate is normal and guest hang also cannot be observed.

  Machine type: Dell PowerEdge R920
  Memory: 528GB with 4 NUMA nodes
  CPU: 120 cores

  [4].
  http://kernel.ubuntu.com/~gavinguo/sf00131845/virtual_appliances_loading.png

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1680513] Re: Migrating KSM page causes the VM lock up as the KSM page merging list is too large

Reply via email to