[Kernel-packages] [Bug 1413540] Re: Trusty soft lockup issues with nested KVM

Guy Baconniere Wed, 22 Apr 2015 09:17:20 -0700

I still have the same issue with kernel 3.16.0-36-generic or 3.13.0-51-generic 
(proposed-updates)


# KVM HOST (3.16.0-36-generic)
sudo apt-get install linux-signed-generic-lts-utopic/trusty-proposed

# KVM GUEST (3.16.0-36-generic)
sudo apt-get install linux-virtual-lts-utopic/trusty-proposed
apt-get install cloud-installer
cloud-install

[ 1196.920613] kvm: vmptrld           (null)/780000000000 failed
[ 1196.920953] vmwrite error: reg 401e value 31 (err 1)
[ 1196.921243] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 
3.16.0-36-generic #48~14.04.1-Ubuntu
[ 1196.921244] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014
[ 1196.921245]  0000000000000000 ffff88202018fb58 ffffffff81764a5f 
ffff880fe4960000
[ 1196.921248]  ffff88202018fb68 ffffffffc0a9320d ffff88202018fb78 
ffffffffc0a878bf
[ 1196.921250]  ffff88202018fba8 ffffffffc0a8e1cf ffff880fe4960000 
0000000000000000
[ 1196.921252] Call Trace:
[ 1196.921262]  [<ffffffff81764a5f>] dump_stack+0x45/0x56
[ 1196.921277]  [<ffffffffc0a9320d>] vmwrite_error+0x2c/0x2e [kvm_intel]
[ 1196.921280]  [<ffffffffc0a878bf>] vmcs_writel+0x1f/0x30 [kvm_intel]
[ 1196.921283]  [<ffffffffc0a8e1cf>] free_nested.part.73+0x5f/0x170 [kvm_intel]
[ 1196.921286]  [<ffffffffc0a8e363>] vmx_free_vcpu+0x33/0x70 [kvm_intel]
[ 1196.921305]  [<ffffffffc03fd774>] kvm_arch_vcpu_free+0x44/0x50 [kvm]
[ 1196.921312]  [<ffffffffc03fe3e2>] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm]
[ 1196.921318]  [<ffffffff810d28ed>] ? synchronize_srcu+0x1d/0x20
[ 1196.921323]  [<ffffffffc03e635e>] kvm_put_kvm+0x10e/0x220 [kvm]
[ 1196.921328]  [<ffffffffc03e64a8>] kvm_vcpu_release+0x18/0x20 [kvm]
[ 1196.921331]  [<ffffffff811d5e64>] __fput+0xe4/0x220
[ 1196.921333]  [<ffffffff811d5fee>] ____fput+0xe/0x10
[ 1196.921337]  [<ffffffff8108e284>] task_work_run+0xc4/0xe0
[ 1196.921342]  [<ffffffff810702c8>] do_exit+0x2b8/0xa60
[ 1196.921345]  [<ffffffff810e3792>] ? __unqueue_futex+0x32/0x70
[ 1196.921347]  [<ffffffff810e47b6>] ? futex_wait+0x126/0x290
[ 1196.921349]  [<ffffffff8109eb35>] ? check_preempt_curr+0x85/0xa0
[ 1196.921351]  [<ffffffff81070aef>] do_group_exit+0x3f/0xa0
[ 1196.921353]  [<ffffffff81080530>] get_signal_to_deliver+0x1d0/0x6f0
[ 1196.921357]  [<ffffffff81012548>] do_signal+0x48/0xad0
[ 1196.921359]  [<ffffffff81011627>] ? __switch_to+0x167/0x590
[ 1196.921361]  [<ffffffff81013039>] do_notify_resume+0x69/0xb0
[ 1196.921364]  [<ffffffff8176d44a>] int_signal+0x12/0x17
[ 1196.921365] vmwrite error: reg 2800 value ffffffffffffffff (err -255)
[ 1196.921733] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 
3.16.0-36-generic #48~14.04.1-Ubuntu
[ 1196.921734] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014
[ 1196.921735]  0000000000000000 ffff88202018fb58 ffffffff81764a5f 
ffff880fe4960000
[ 1196.921736]  ffff88202018fb68 ffffffffc0a9320d ffff88202018fb78 
ffffffffc0a878bf
[ 1196.921737]  ffff88202018fba8 ffffffffc0a8e1e0 ffff880fe4960000 
0000000000000000
[ 1196.921739] Call Trace:
[ 1196.921741]  [<ffffffff81764a5f>] dump_stack+0x45/0x56
[ 1196.921744]  [<ffffffffc0a9320d>] vmwrite_error+0x2c/0x2e [kvm_intel]
[ 1196.921746]  [<ffffffffc0a878bf>] vmcs_writel+0x1f/0x30 [kvm_intel]
[ 1196.921748]  [<ffffffffc0a8e1e0>] free_nested.part.73+0x70/0x170 [kvm_intel]
[ 1196.921751]  [<ffffffffc0a8e363>] vmx_free_vcpu+0x33/0x70 [kvm_intel]
[ 1196.921757]  [<ffffffffc03fd774>] kvm_arch_vcpu_free+0x44/0x50 [kvm]
[ 1196.921763]  [<ffffffffc03fe3e2>] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm]
[ 1196.921765]  [<ffffffff810d28ed>] ? synchronize_srcu+0x1d/0x20
[ 1196.921770]  [<ffffffffc03e635e>] kvm_put_kvm+0x10e/0x220 [kvm]
[ 1196.921774]  [<ffffffffc03e64a8>] kvm_vcpu_release+0x18/0x20 [kvm]
[ 1196.921775]  [<ffffffff811d5e64>] __fput+0xe4/0x220
[ 1196.921777]  [<ffffffff811d5fee>] ____fput+0xe/0x10
[ 1196.921778]  [<ffffffff8108e284>] task_work_run+0xc4/0xe0
[ 1196.921780]  [<ffffffff810702c8>] do_exit+0x2b8/0xa60
[ 1196.921782]  [<ffffffff810e3792>] ? __unqueue_futex+0x32/0x70
[ 1196.921783]  [<ffffffff810e47b6>] ? futex_wait+0x126/0x290
[ 1196.921784]  [<ffffffff8109eb35>] ? check_preempt_curr+0x85/0xa0
[ 1196.921786]  [<ffffffff81070aef>] do_group_exit+0x3f/0xa0
[ 1196.921788]  [<ffffffff81080530>] get_signal_to_deliver+0x1d0/0x6f0
[ 1196.921790]  [<ffffffff81012548>] do_signal+0x48/0xad0
[ 1196.921791]  [<ffffffff81011627>] ? __switch_to+0x167/0x590
[ 1196.921793]  [<ffffffff81013039>] do_notify_resume+0x69/0xb0
[ 1196.921795]  [<ffffffff8176d44a>] int_signal+0x12/0x17
[ 1270.766540] device vnet3 entered promiscuous mode
[ 1270.865885] device vnet4 entered promiscuous mode
[ 1273.824576] kvm: zapping shadow pages for mmio generation wraparound
[ 1447.725335] kvm [6152]: vcpu0 unhandled rdmsr: 0x606

uvt-kvm create \
--memory 16384 \
--disk 100 \
--cpu 2 \
--ssh-public-key-file uvt-authorized_keys \
--template uvt-template.xml \
test release=trusty arch=amd64

....
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='23'/>
    <vcpupin vcpu='1' cpuset='31'/>
    <emulatorpin cpuset='23,31'/>
  </cputune>
  <cpu mode='custom' match='exact'>
    <model fallback='allow'>SandyBridge</model>
    <vendor>Intel</vendor>
    <topology sockets='1' cores='2' threads='1'/>
    <feature policy='require' name='vmx'/>
....

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1413540

Title:
  Trusty soft lockup issues with nested KVM

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed

Bug description:
  [Impact]
  Upstream discussion: https://lkml.org/lkml/2015/2/11/247

  Certain workloads that need to execute functions on a non-local CPU
  using smp_call_function_* can result in soft lockups with the
  following backtrace:

  PID: 22262  TASK: ffff8804274bb000  CPU: 1   COMMAND: "qemu-system-x86"
   #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
   #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
   #2 [ffff88043fd03e30] panic at ffffffff81719ff4
   #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
   #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
   #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
   #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
   #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
   #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
  --- <IRQ stack> ---
   #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
      [exception RIP: generic_exec_single+130]
      RIP: ffffffff810dbe62  RSP: ffff880426f0da00  RFLAGS: 00000202
      RAX: 0000000000000002  RBX: ffff880426f0d9d0  RCX: 0000000000000001
      RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
      RBP: ffff880426f0da30   R8: ffffffff8180ad48   R9: ffff88042713bc68
      R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: ffff8804274bb000
      R13: 0000000000000000  R14: ffff880407670280  R15: 0000000000000000
      ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
  #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
  #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
  #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
  #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
  #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
  #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
  #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
  #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
  #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
  #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
  #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
  #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
  #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
      RIP: 00007fe7ca2cc647  RSP: 00007fe7be9febf0  RFLAGS: 00000293
      RAX: 000000000000001c  RBX: ffffffff8173196d  RCX: ffffffffffffffff
      RDX: 0000000000000004  RSI: 00000000007fb000  RDI: 00007fe7be1ff000
      RBP: 0000000000000000   R8: 0000000000000000   R9: 00007fe7d1cd2738
      R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: 00007fe7be9ff700
      R13: 00007fe7be9ff9c0  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: 000000000000001c  CS: 0033  SS: 002b

  [Fix]

  commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3,
  Mitigates this issue if b6b8a1451fc40412c57d1 is applied (as in the case of 
the affected 3.13 distro kernel. However the issue can still occur in some 
cases.

  
  [Workaround]

  In order to avoid this issue, the workload needs to be pinned to CPUs
  such that the function always executes locally. For the nested VM
  case, this means the the L1 VM needs to have all vCPUs pinned to a
  unique CPU. This can be accomplished with the following (for 2 vCPUs):

  virsh vcpupin <domain> 0 0
  virsh vcpupin <domain> 1 1

  [Test Case]
  - Deploy openstack on openstack
  - Run tempest on L1 cloud
  - Check kernel log of L1 nova-compute nodes

  (Although this may not necessarily be related to nested KVM)
  Potentially related: https://lkml.org/lkml/2014/11/14/656

  Another test case is to do the following (on affected hardware):

  1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
  2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
  3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

  Sometimes this is sufficient to reproduce the issue, I've observed that 
running
  KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
  If this doesn't reproduce then you can do the following:
  4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
  L1 vCPUs until the hang occurs.

  --

  Original Description:

  When installing qemu-kvm on a VM, KSM is enabled.

  I have encountered this problem in trusty:$ lsb_release -a
  Distributor ID: Ubuntu
  Description:    Ubuntu 14.04.1 LTS
  Release:        14.04
  Codename:       trusty
  $ uname -a
  Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 
17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

  The way to see the behaviour:
  1) $ more /sys/kernel/mm/ksm/run
  0
  2) $ sudo apt-get install qemu-kvm
  3) $ more /sys/kernel/mm/ksm/run
  1

  To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, 
run tempest on it, the compute nodes of the virtualised deployment will 
eventually stop responding with (run tempest 2 times at least):
   24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]

  I am not sure whether the problem is that we are enabling KSM on a VM
  or the problem is that nested KSM is not behaving properly. Either way
  I can easily reproduce, please contact me if you need further details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1413540] Re: Trusty soft lockup issues with nested KVM

Reply via email to