I can observe this only one of the Intel Xeon machines (which has 48
CPUs and 1TB memory), but very reliably reproducible.
Reproducer:
- Just ensure physical host (L0) and guest hypervisor (L1) are running
3.20.0-0.rc0.git5.1 Kernel (I used from Fedora's Rawhide).
Preferably on an Intel Xeon machine - as that's where I could
reproduce this issue, not on a Haswell machine
- Boot an L2 guest: Run `qemu-sanity-check --accel=kvm` in L1 (or
your own preferred method to boot an L2 KVM guest).
- On a different terminal, which has serial console for L1: observe L1
reboot
The only thing I notice in `demsg` (on L0) is this trace. _However_ this
trace does not occur when an L1 reboot is triggered while you watch
`dmesg -w` (to wait for new messages) as I boot an L2 guest -- which
means, the below trace is not the root cause of L1 being rebooted. When
the L2 gets rebooted, what you observe is just one of these messages
"vcpu0 unhandled rdmsr: 0x1a6" below
. . .
[Feb16 13:44] ------------[ cut here ]------------
[ +0.004632] WARNING: CPU: 4 PID: 1837 at arch/x86/kvm/vmx.c:9190
nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]()
[ +0.009835] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ip6table_filter
ip6_tables cfg80211 rfkill iTCO_wdt iTCO_vendor_support ipmi_devintf gpio_ich
dcdbas coretemp kvm_intel kvm crc32c_intel ipmi_ssif serio_raw acpi_power_meter
ipmi_si tpm_tis ipmi_msghandler tpm lpc_ich i7core_edac mfd_core edac_core
acpi_cpufreq shpchp wmi mgag200 i2c_algo_bit drm_kms_helper ttm ata_generic drm
pata_acpi megaraid_sas bnx2
[ +0.050289] CPU: 4 PID: 1837 Comm: qemu-system-x86 Not tainted
3.20.0-0.rc0.git5.1.fc23.x86_64 #1
[ +0.008902] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.8.2
10/25/2012
[ +0.007469] 0000000000000000 00000000ee6c0c54 ffff88bf60bf7c18
ffffffff818760f7
[ +0.007542] 0000000000000000 0000000000000000 ffff88bf60bf7c58
ffffffff810ab80a
[ +0.007519] ffff88ff625b8000 ffff883f55f9b000 0000000000000000
0000000000000014
[ +0.007489] Call Trace:
[ +0.002471] [<ffffffff818760f7>] dump_stack+0x4c/0x65
[ +0.005152] [<ffffffff810ab80a>] warn_slowpath_common+0x8a/0xc0
[ +0.006020] [<ffffffff810ab93a>] warn_slowpath_null+0x1a/0x20
[ +0.005851] [<ffffffffa130957e>] nested_vmx_vmexit+0x96e/0xb00 [kvm_intel]
[ +0.006974] [<ffffffffa130c5f7>] ? vmx_handle_exit+0x1e7/0xcb2 [kvm_intel]
[ +0.006999] [<ffffffffa02ca972>] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm]
[ +0.007239] [<ffffffffa130992a>] vmx_queue_exception+0x10a/0x150 [kvm_intel]
[ +0.007136] [<ffffffffa02cb30b>] kvm_arch_vcpu_ioctl_run+0x106b/0x1b50 [kvm]
[ +0.007162] [<ffffffffa02ca972>] ? kvm_arch_vcpu_ioctl_run+0x6d2/0x1b50 [kvm]
[ +0.007241] [<ffffffff8110760d>] ? trace_hardirqs_on+0xd/0x10
[ +0.005864] [<ffffffffa02b2df6>] ? vcpu_load+0x26/0x70 [kvm]
[ +0.005761] [<ffffffff81103c0f>] ? lock_release_holdtime.part.29+0xf/0x200
[ +0.006979] [<ffffffffa02c5f88>] ? kvm_arch_vcpu_load+0x58/0x210 [kvm]
[ +0.006634] [<ffffffffa02b3203>] kvm_vcpu_ioctl+0x383/0x7e0 [kvm]
[ +0.006197] [<ffffffff81027b9d>] ? native_sched_clock+0x2d/0xa0
[ +0.006026] [<ffffffff810d5fc6>] ? creds_are_invalid.part.1+0x16/0x50
[ +0.006537] [<ffffffff810d6021>] ? creds_are_invalid+0x21/0x30
[ +0.005930] [<ffffffff813a61da>] ? inode_has_perm.isra.48+0x2a/0xa0
[ +0.006365] [<ffffffff8128c7b8>] do_vfs_ioctl+0x2e8/0x530
[ +0.005496] [<ffffffff8128ca81>] SyS_ioctl+0x81/0xa0
[ +0.005065] [<ffffffff8187f8e9>] system_call_fastpath+0x12/0x17
[ +0.006014] ---[ end trace 2f24e0820b44f686 ]---
[ +5.870886] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9
[ +0.004991] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6
[ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6
[Feb16 14:18] kvm [1783]: vcpu0 unhandled rdmsr: 0x1c9
[ +0.005020] kvm [1783]: vcpu0 unhandled rdmsr: 0x1a6
[ +0.004998] kvm [1783]: vcpu0 unhandled rdmsr: 0x3f6
. . .
Version
-------
Exact below versions were used on L0 and L1:
$ uname -r; rpm -q qemu-system-x86
3.20.0-0.rc0.git5.1.fc23.x86_64
qemu-system-x86-2.2.0-5.fc22.x86_64
Other info
----------
- Unpacking the kernel-3.20.0-0.rc0.git5.1.fc23.src.rpm and looking at
this file, arch/x86/kvm/vmx.c, line 9190 is below, with contextual
code:
[. . .]
9178 * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
9179 * and modify vmcs12 to make it see what it would expect to see there if
9180 * L2 was its real guest. Must only be called when in L2
(is_guest_mode())
9181 */
9182 static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
9183 u32 exit_intr_info,
9184 unsigned long exit_qualification)
9185 {
9186 struct vcpu_vmx *vmx = to_vmx(vcpu);
9187 struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
9188
9189 /* trying to cancel vmlaunch/vmresume is a bug */
9190 WARN_ON_ONCE(vmx->nested.nested_run_pending);
9191
9192 leave_guest_mode(vcpu);
9193 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
9194 exit_qualification);
9195
9196 vmx_load_vmcs01(vcpu);
9197
9198 if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
9199 && nested_exit_intr_ack_set(vcpu)) {
9200 int irq = kvm_cpu_get_interrupt(vcpu);
9201 WARN_ON(irq < 0);
9202 vmcs12->vm_exit_intr_info = irq |
9203 INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
9204 }
- The above line 9190 was introduced in this commt:
$ git log -S'WARN_ON_ONCE(vmx->nested.nested_run_pending)' \
-- ./arch/x86/kvm/vmx.c
commit 5f3d5799974b89100268ba813cec8db7bd0693fb
Author: Jan Kiszka <[email protected]>
Date: Sun Apr 14 12:12:46 2013 +0200
KVM: nVMX: Rework event injection and recovery
The basic idea is to always transfer the pending event injection on
vmexit into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1, i.e. if we enter
prepare_vmcs12.
vmcs12_save_pending_events takes care to transfer pending L0 events into
the queue of L1. That is mandatory as L1 may decide to switch the guest
state completely, invalidating or preserving the pending events for
later injection (including on a different node, once we support
migration).
This concept is based on the rule that a pending vmlaunch/vmresume is
not canceled. Otherwise, we would risk to lose injected events or leak
them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
entry of nested_vmx_vmexit.
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
- `dmesg`, `dmidecode`, `x86info -a` details of L0 and L1 here
https://kashyapc.fedorapeople.org/virt/Info-L0-Intel-Xeon-and-L1-nVMX-test/
--
/kashyap
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html