Hi, Just wanted to chime in that this bug also affected me - running OpenStack Juno w/KVM inside a KVM hypervisor.
CPU on the host machine is: vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz running 14.04 with the latest packages applied as of today (2015-03-27) for both the host and the guest. Lockup appeared to happen with one host-guest VM after I altered the number of CPUs allocated to another VM (yet to reboot that VM for changes to take affect), though I had also recently booted a new host- guest-guest VM. ar 27 15:12:43 compute ntpd[1775]: peers refreshed Mar 27 15:12:43 compute ntpd[1775]: new interface(s) found: waking up resolver Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPDISCOVER(br100) fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPOFFER(br100) 203.0.113.27 fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPREQUEST(br100) 203.0.113.27 fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPACK(br100) 203.0.113.27 fa:16:3e:c3:81:22 test03 Mar 27 15:15:40 compute kernel: [ 436.100002] BUG: soft lockup - CPU#5 stuck for 23s! [ksmd:68] Mar 27 15:15:40 compute kernel: [ 436.100002] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM ebt_ip ebt_arp ebtable_filter br idge stp llc xt_conntrack xt_nat xt_tcpudp iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6tabl e_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_ tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_intel cirrus snd_hda_codec ttm snd_hwdep drm_kms_helper snd_pcm drm snd_page_alloc snd_ timer syscopyarea snd sysfillrect soundcore sysimgblt dm_multipath i2c_piix4 kvm_intel scsi_dh serio_raw kvm mac_hid lp parport 8139too psmous e 8139cp mii floppy pata_acpi Mar 27 15:15:40 compute kernel: [ 436.100002] CPU: 5 PID: 68 Comm: ksmd Not tainted 3.13.0-46-generic #79-Ubuntu Mar 27 15:15:40 compute kernel: [ 436.100002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Mar 27 15:15:40 compute kernel: [ 436.100002] task: ffff8802306db000 ti: ffff8802306e4000 task.ti: ffff8802306e4000 Mar 27 15:15:40 compute kernel: [ 436.100002] RIP: 0010:[<ffffffff810dbf56>] [<ffffffff810dbf56>] generic_exec_single+0x86/0xb0 Mar 27 15:15:40 compute kernel: [ 436.100002] RSP: 0018:ffff8802306e5c00 EFLAGS: 00000202 Mar 27 15:15:40 compute kernel: [ 436.100002] RAX: 0000000000000006 RBX: ffff8802306e5bd0 RCX: 0000000000000005 Mar 27 15:15:40 compute kernel: [ 436.100002] RDX: ffffffff8180ade0 RSI: 0000000000000000 RDI: 0000000000000286 Mar 27 15:15:40 compute kernel: [ 436.100002] RBP: ffff8802306e5c30 R08: ffffffff8180adc8 R09: ffff880232989b48 Mar 27 15:15:40 compute kernel: [ 436.100002] R10: 0000000000000867 R11: 0000000000000000 R12: 0000000000000000 Mar 27 15:15:40 compute kernel: [ 436.100002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Mar 27 15:15:40 compute kernel: [ 436.100002] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000 Mar 27 15:15:40 compute kernel: [ 436.100002] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 27 15:15:40 compute kernel: [ 436.100002] CR2: 00007fb0557bf000 CR3: 0000000036b7d000 CR4: 00000000000026e0 Mar 27 15:15:40 compute kernel: [ 436.100002] Stack: Mar 27 15:15:40 compute kernel: [ 436.100002] ffff88023fd13f80 0000000000000004 0000000000000005 ffffffff81d14300 Mar 27 15:15:40 compute kernel: [ 436.100002] ffffffff8105c7a0 ffff88023212c380 ffff8802306e5ca8 ffffffff810dc065 Mar 27 15:15:40 compute kernel: [ 436.100002] 00000000000134c0 00000000000134c0 ffff88023fd13f80 ffff88023fd13f80 Mar 27 15:15:40 compute kernel: [ 436.100002] Call Trace: Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff810dc065>] smp_call_function_single+0xe5/0x190 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffffa008c21a>] ? kvm_handle_hva_range+0x11a/0x180 [kvm] Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffffa008f300>] ? rmap_write_protect+0x80/0x80 [kvm] Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff810dc496>] smp_call_function_many+0x286/0x2d0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8105c8f7>] native_flush_tlb_others+0x37/0x40 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8105cbf6>] flush_tlb_page+0x56/0xa0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8118a4c8>] ptep_clear_flush+0x48/0x60 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8119cd9f>] try_to_merge_with_ksm_page+0x14f/0x650 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8119de36>] ksm_do_scan+0xb96/0xdb0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8119e0cf>] ksm_scan_thread+0x7f/0x200 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff810ab100>] ? prepare_to_wait_event+0x100/0x100 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8119e050>] ? ksm_do_scan+0xdb0/0xdb0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8108b592>] kthread+0xd2/0xf0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8108b4c0>] ? kthread_create_on_node+0x1c0/0x1c0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff81731ccc>] ret_from_fork+0x7c/0xb0 Mar 27 15:15:40 compute kernel: [ 436.100002] [<ffffffff8108b4c0>] ? kthread_create_on_node+0x1c0/0x1c0 Mar 27 15:15:40 compute kernel: [ 436.100002] Code: 4c 89 23 48 89 4b 08 48 89 19 48 89 55 d0 e8 f2 d1 64 00 4c 3b 65 d0 74 23 45 85 ed 75 09 eb 0d 0f 1f 44 00 00 f3 90 f6 43 20 01 <75> f8 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 f7 ff Mar 27 15:16:08 compute kernel: [ 464.100002] BUG: soft lockup - CPU#5 stuck for 23s! [ksmd:68] I've got to keep working to get this environment up for a test, but if any specific logs or tests can be helpful, let me know and I'll see what I can provide. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM Status in linux package in Ubuntu: Confirmed Bug description: [Impact] Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203 #2 [ffff88043fd03e30] panic at ffffffff81719ff4 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> --- #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd [exception RIP: generic_exec_single+130] RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001 RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286 RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000 R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293 RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000 RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700 R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least): 24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp