Interestingly, I hit this warning log without enabling ksm

```console
# cat /sys/kernel/mm/ksm/run
0
# uname -a
Linux compute12 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic
```

log is

[Sat May 15 11:28:32 2021] WARNING: CPU: 31 PID: 3196546 at 
/build/linux-E6MDAa/linux-4.15.0/include/linux/mm.h:857 
follow_page_pte+0x663/0x6d0
[Sat May 15 11:28:32 2021] Modules linked in: nls_iso8859_1 act_police cls_u32 
sch_ingress cls_fw sch_sfq sch_htb ip6table_raw xt_CT xt_mac vhost_net vhost 
tap ebtable_filter ebtables ip6table_filter devlink vxlan ip6_udp_tunnel 
udp_tunnel ip_gre gre xt_multiport xt_set iptable_raw iptable_mangle 
ip_set_hash_net ip_set_hash_ip ip_set ipip tunnel4 ip_tunnel veth xt_statistic 
xt_physdev xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_addrtype 
ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_nat ip6_tables xt_comment xt_mark 
iptable_filter xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat aufs rbd libceph overlay 
openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_defrag_ipv6 nf_nat bonding dm_service_time dm_multipath
[Sat May 15 11:28:32 2021]  scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl 
skx_edac x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass intel_cstate 
intel_rapl_perf ipmi_ssif ioatdma joydev input_leds acpi_power_meter mei_me mei 
shpchp mac_hid ipmi_si ipmi_devintf ipmi_msghandler lpc_ich sch_fq_codel 
nf_conntrack ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi br_netfilter bridge stp llc ip_tables x_tables 
autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
ses enclosure scsi_transport_sas hid_generic crct10dif_pclmul crc32_pclmul 
usbhid ghash_clmulni_intel hid pcbc lpfc aesni_intel aes_x86_64 nvmet_fc 
crypto_simd ast glue_helper nvmet cryptd nvme_fc ttm nvme_fabrics
[Sat May 15 11:28:32 2021]  igb nvme_core drm_kms_helper dca scsi_transport_fc 
syscopyarea i2c_algo_bit sysfillrect sysimgblt i40e aacraid fb_sys_fops drm ptp 
pps_core ahci libahci wmi
[Sat May 15 11:28:32 2021] CPU: 31 PID: 3196546 Comm: CPU 2/KVM Not tainted 
4.15.0-72-generic #81-Ubuntu
[Sat May 15 11:28:32 2021] Hardware name: Inspur NF5280M5/YZMB-00882-104, BIOS 
4.0.8 10/17/2018
[Sat May 15 11:28:32 2021] RIP: 0010:follow_page_pte+0x663/0x6d0
[Sat May 15 11:28:32 2021] RSP: 0018:ffffb1eff4e5b8f8 EFLAGS: 00010286
[Sat May 15 11:28:32 2021] RAX: ffffe041b58cba40 RBX: ffffe043fed90cf0 RCX: 
0000000080000000
[Sat May 15 11:28:32 2021] RDX: ffffe041b58cba40 RSI: 00007f7306766000 RDI: 
8000000d632e9225
[Sat May 15 11:28:32 2021] RBP: ffffb1eff4e5b960 R08: 8000000d632e9225 R09: 
ffffa0249cceb1e0
[Sat May 15 11:28:32 2021] R10: 0000000000000000 R11: ffffb1eff4e5ba8c R12: 
ffffe041b58cba40
[Sat May 15 11:28:32 2021] R13: 00003ffffffff000 R14: 0000000000000326 R15: 
ffffa076af75a198
[Sat May 15 11:28:32 2021] FS:  00007f73f48ee700(0000) 
GS:ffffa0947f2c0000(0000) knlGS:fffff88001e81000
[Sat May 15 11:28:32 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat May 15 11:28:32 2021] CR2: fffff8a016819000 CR3: 0000004e72518004 CR4: 
00000000007626e0
[Sat May 15 11:28:32 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[Sat May 15 11:28:32 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[Sat May 15 11:28:32 2021] PKRU: 55555554
[Sat May 15 11:28:32 2021] Call Trace:
[Sat May 15 11:28:32 2021]  follow_pmd_mask+0x209/0x640
[Sat May 15 11:28:32 2021]  follow_page_mask+0x17a/0x210
[Sat May 15 11:28:32 2021]  __get_user_pages+0x18c/0x720
[Sat May 15 11:28:32 2021]  get_user_pages+0x42/0x50
[Sat May 15 11:28:32 2021]  __gfn_to_pfn_memslot+0x126/0x410 [kvm]
[Sat May 15 11:28:32 2021]  try_async_pf+0x66/0x1f0 [kvm]
[Sat May 15 11:28:32 2021]  tdp_page_fault+0x138/0x290 [kvm]
[Sat May 15 11:28:32 2021]  ? vmexit_fill_RSB+0x1c/0x40 [kvm_intel]
[Sat May 15 11:28:32 2021]  kvm_mmu_page_fault+0x62/0x160 [kvm]
[Sat May 15 11:28:32 2021]  handle_ept_violation+0xbb/0x150 [kvm_intel]
[Sat May 15 11:28:32 2021]  vmx_handle_exit+0xb3/0xe80 [kvm_intel]
[Sat May 15 11:28:32 2021]  ? vmexit_fill_RSB+0x1c/0x40 [kvm_intel]
[Sat May 15 11:28:32 2021]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
[Sat May 15 11:28:32 2021]  ? vmexit_fill_RSB+0x1c/0x40 [kvm_intel]
[Sat May 15 11:28:32 2021]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
[Sat May 15 11:28:32 2021]  ? vmx_vcpu_run+0x3fa/0x600 [kvm_intel]
[Sat May 15 11:28:32 2021]  vcpu_enter_guest+0x424/0x1260 [kvm]
[Sat May 15 11:28:32 2021]  ? __schedule+0x256/0x880
[Sat May 15 11:28:32 2021]  kvm_arch_vcpu_ioctl_run+0x203/0x3e0 [kvm]
[Sat May 15 11:28:32 2021]  ? kvm_arch_vcpu_ioctl_run+0x203/0x3e0 [kvm]
[Sat May 15 11:28:32 2021]  kvm_vcpu_ioctl+0x2a6/0x620 [kvm]
[Sat May 15 11:28:32 2021]  ? do_futex+0x185/0x590
[Sat May 15 11:28:32 2021]  do_vfs_ioctl+0xa8/0x630
[Sat May 15 11:28:32 2021]  ? SyS_futex+0x13b/0x180
[Sat May 15 11:28:32 2021]  SyS_ioctl+0x79/0x90
[Sat May 15 11:28:32 2021]  ? fire_user_return_notifiers+0x3e/0x50
[Sat May 15 11:28:32 2021]  do_syscall_64+0x73/0x130
[Sat May 15 11:28:32 2021]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Sat May 15 11:28:32 2021] RIP: 0033:0x7f73ff02f5d7
[Sat May 15 11:28:32 2021] RSP: 002b:00007f73f48ed818 EFLAGS: 00000246 
ORIG_RAX: 0000000000000010
[Sat May 15 11:28:32 2021] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f73ff02f5d7
[Sat May 15 11:28:32 2021] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000017
[Sat May 15 11:28:32 2021] RBP: 000055d6a736d3f0 R08: 000055d6a5edf270 R09: 
000000000000ffff
[Sat May 15 11:28:32 2021] R10: 000000000000000c R11: 0000000000000246 R12: 
0000000000000000
[Sat May 15 11:28:32 2021] R13: 00007f7404cd6000 R14: 0000000000000000 R15: 
000055d6a736d3f0
[Sat May 15 11:28:32 2021] Code: 20 41 bd ef ff ff ff 48 39 c8 74 17 48 8b 45 
d0 48 8b 75 c0 48 8b 55 b8 48 8b 78 40 e8 f7 c8 e5 ff 66 90 4d 63 e5 e9 1e fd 
ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 10 fd ff ff f0 48 83 2f 01 0f 85 
[Sat May 15 11:28:32 2021] ---[ end trace 5685e985b988fffa ]---

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1837810

Title:
  KVM: Fix zero_page reference counter overflow when using KSM on KVM
  compute host

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1837810

  [Impact]

  We are seeing a problem on OpenStack compute nodes, and KVM hosts,
  where a kernel oops is generated, and all running KVM machines are
  placed into the pause state.

  This is caused by the kernel's reserved zero_page reference counter
  overflowing from a positive number to a negative number, and hitting a
  (WARN_ON_ONCE(page_ref_count(page) <= 0)) condition in try_get_page().

  This only happens if the machine has Kernel Samepage Mapping (KSM)
  enabled, with "use_zero_pages" turned on. Each time a new VM starts
  and the kernel does a KSM merge run during a EPT violation, the
  reference counter for the zero_page is incremented in try_async_pf()
  and never decremented. Eventually, the reference counter will
  overflow, causing the KVM subsystem to fail.

  Syslog:
  error : qemuMonitorJSONCheckError:392 : internal error: unable to execute 
QEMU command 'cont': Resetting the Virtual Machine is required

  QEMU Logs:
  error: kvm run failed Bad address
  EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
  ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
  EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
  SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
  TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
  GDT=     000f7040 00000037
  IDT=     000f707e 00000000
  CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
  DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000 
  DR6=00000000ffff0ff0 DR7=0000000000000400
  EFER=0000000000000000
  Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 
00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31

  Kernel Oops:

  [  167.695986] WARNING: CPU: 1 PID: 3016 at 
/build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 
follow_page_pte+0x6f4/0x710
  [  167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G           OE    
4.15.0-106-generic #107~16.04.1-Ubuntu
  [  167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1ubuntu1 04/01/2014
  [  167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710
  [  167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286
  [  167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 
0000000080000000
  [  167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 
80000001b8cea225
  [  167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: 
ffff90c4d55fa340
  [  167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffffed8786e33a80
  [  167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: 
ffff90c4d55fa340
  [  167.696030] FS:  00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) 
knlGS:0000000000000000
  [  167.696030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 
0000000000162ee0
  [  167.696033] Call Trace:
  [  167.696047]  follow_pmd_mask+0x273/0x630
  [  167.696049]  follow_page_mask+0x178/0x230
  [  167.696051]  __get_user_pages+0xb8/0x740
  [  167.696052]  get_user_pages+0x42/0x50
  [  167.696068]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
  [  167.696079]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
  [  167.696090]  try_async_pf+0x66/0x220 [kvm]
  [  167.696101]  tdp_page_fault+0x14b/0x2b0 [kvm]
  [  167.696104]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
  [  167.696114]  kvm_mmu_page_fault+0x62/0x180 [kvm]
  [  167.696117]  handle_ept_violation+0xbc/0x160 [kvm_intel]
  [  167.696119]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
  [  167.696129]  vcpu_enter_guest+0x414/0x1260 [kvm]
  [  167.696138]  ? kvm_arch_vcpu_load+0x4d/0x280 [kvm]
  [  167.696148]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
  [  167.696157]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
  [  167.696165]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
  [  167.696166]  ? do_futex+0x129/0x590
  [  167.696171]  ? __switch_to+0x34c/0x4e0
  [  167.696174]  ? __switch_to_asm+0x35/0x70
  [  167.696176]  do_vfs_ioctl+0xa4/0x600
  [  167.696177]  SyS_ioctl+0x79/0x90
  [  167.696180]  ? exit_to_usermode_loop+0xa5/0xd0
  [  167.696181]  do_syscall_64+0x73/0x130
  [  167.696182]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [  167.696184] RIP: 0033:0x7f6a80482007
  [  167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
  [  167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f6a80482007
  [  167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000016
  [  167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 
0000000000000001
  [  167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 
0000000000000000
  [  167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 
000055fe135f3240
  [  167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 
9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 
49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f 
  [  167.696200] ---[ end trace 7573f6868ea8f069 ]---

  [Fix]

  This was fixed in 5.6-rc1 with the following commit:

  commit 7df003c85218b5f5b10a7f6418208f31e813f38f
  Author: Zhuang Yanying <[email protected]>
  Date:   Sat Oct 12 11:37:31 2019 +0800
  Subject:  KVM: fix overflow of zero page refcount with ksm running
  Link: 
https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f
 

  The fix adds a check to see if the Page Frame Number (pfn) is linked
  to the zero page, and if it is, treats it as reserved. This has the
  effect that put_page() is no longer called on the zero_page, and
  reference counting is no longer needed.

  This is a clean cherry pick to Bionic and Focal kernels.

  [Testcase]

  Create a new KVM host, and make sure it has plenty of ram. 16gb should
  be okay.

  Install KVM packages:

  $ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage
  virtinst

  Enable Kernel Samepage Mapping, and use_zero_pages:

  $ echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan
  $ echo 1 | sudo tee /sys/kernel/mm/ksm/run
  $ echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages

  I wrote a script which creates and destroys xenial KVM VMs in a infinite loop:
  https://paste.ubuntu.com/p/CvRTsDkdC7/

  Save the script to disk, and execute it:

  $ chmod +x ksm_refcnt_overflow.sh
  $ ./ksm_refcnt_overflow.sh

  Each time a VM is created and destroyed the reference counter will
  increase.

  I wrote a kernel module which exposes a /proc interface, which we can
  use to look at the value of the zero_page reference counter. It works
  by taking the memory allocated for the zero page: empty_zero_page,
  which is defined in arch/x86/include/asm/pgtable.h, running
  virt_to_page() to get the page struct, which we can then dereference
  to get _refcount;

  https://paste.ubuntu.com/p/MJMN8jMVds/

  Save the module to disk, create its Makefile from the included
  documentation, and build it:

  $ make
  $ sudo insmod zero_page_refcount.ko 

  From there, we can examine the reference counter with:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x687 or 1671
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x846 or 2118
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x9f8 or 2552
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0xcb2 or 3250 

  We see it steadily increase. Instead of waiting months for it to
  overflow, I implemented a /proc entry to set it to near overflow. You
  can use it with:

  $ cat /proc/zero_page_refcount_set
  Zero Page Refcount set to 0x1FFFFFFFFF000 

  After that, wait a few seconds and the reference counter will
  overflow:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x7fffff16 or 2147483414
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x80000000 or -2147483648 

  All VMs will become paused:

  $ virsh list
  Id Name State
  ----------------------------------------------------
  1 instance-0 paused
  2 instance-1 paused 

  QEMU will error out, and the kernel will oops with the messages in the
  impact section.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf290373-test

  If you install the test kernel and try reproduce, you will notice the
  reference counter is never incremented past 1:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1 

  This resolves the problem.

  [Regression Potential]

  While the change itself seems simple, it changes how the kernel treats
  the zero_page. The zero_page is important, since it is just a page
  full of 0's. Each time memory is allocated which is all 0s, the kernel
  sets it to use the zero_page to save memory. When an application
  writes to the buffer, a EPT violation happens, and the kernel does a
  COW to new pages to hold the data.

  The change is limited to how the KVM subsystem handles the zero_page.
  This will not break the entire kernel if a regression occurs, only
  KVM.

  If a regression were to occur, users could turn off KSM and disable
  KSM use_zero_pages until a fix is ready, as this particular use of
  zero_pages is limited to KSM.

  The fix landed in upstream 5.6, and has not been backported to stable
  kernels.

  I have read a bit of the paging code, especially around where the
  zero_page is used, and where its reference counters were being
  incorrectly incremented.

  I think the fix is correct, and I believe it won't cause any
  regressions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to