Re: VIA Eden X4
"Matwey V. Kornilov"writes: > Hello, > > According to WikiPedia VIA claims x86 hardware assisted virtualization > for VIA Eden X4 CPU. > Does anybody know if it is supported by Linux KVM? > I can't say for sure but my guess is that it should work since VIA implements VT-x like virtualization extensions, so KVM will find VMX capable hardware. Bandan > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Don't report guest userspace emulation error to userspace, why ?
Paolo Bonzini <pbonz...@redhat.com> writes: > On 10/12/2015 18:58, Bandan Das wrote: >>> > Allowing userspace to stop the guest with an emulation failure is a >> This one I don't :) Userspace started the guest after all, there are other >> ways for it to kill the guest if it wanted to. > > I mean allowing guest userspace to stop the guest. Sure! Userspace (Qemu) can just reenter the guest when it sees the failure. Doing it in the host kvm seems overkill. > Paolo > >>> > possible denial of service, similar to L2 stopping L1 with an emulation >>> > failure. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Don't report guest userspace emulation error to userspace, why ?
Paolo Bonzini <pbonz...@redhat.com> writes: >> Paolo Bonzini <pbonz...@redhat.com> writes: >> > On 10/12/2015 18:58, Bandan Das wrote: >> >>> > Allowing userspace to stop the guest with an emulation failure is a >> >> This one I don't :) Userspace started the guest after all, there are other >> >> ways for it to kill the guest if it wanted to. >> > >> > I mean allowing guest userspace to stop the guest. >> >> Sure! Userspace (Qemu) can just reenter the guest when it sees the failure. >> Doing it in the host kvm seems overkill. > > Most userspaces will get it wrong. Doing it once makes sure that you > do it right. I don't buy that. As userspace, I prefer getting to know what error the guest I launched hit and decide what to do. Well, atleast whenever I can. This seems to be one such case. > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Don't report guest userspace emulation error to userspace, why ?
Paolo Bonzini <pbonz...@redhat.com> writes: > On 09/12/2015 23:18, Bandan Das wrote: >> Commit a2b9e6c1a35afcc09: >> >> KVM: x86: Don't report guest userspace emulation error to userspace >> >> Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to >> user-space") disabled the reporting of L2 (nested guest) emulation >> failures to >> userspace due to race-condition between a vmexit and the instruction >> emulator. >> The same rational applies also to userspace applications that are >> permitted by >> the guest OS to access MMIO area or perform PIO. >> >> This patch extends the current behavior - of injecting a #UD instead of >> reporting it to userspace - also for guest userspace code. >> >> I searched the archives but failed in finding anything. Can someone please >> explain why this is needed ? Or, why not let userspace decide what to do >> based >> on the cpl, whether to continue execution or kill the guest ? Is the >> assumption >> here that this is what userspace always wants ? > > Not what userspace always wants, but what the guest kernel always wants. Thanks Paolo, this one I agree. > Allowing userspace to stop the guest with an emulation failure is a This one I don't :) Userspace started the guest after all, there are other ways for it to kill the guest if it wanted to. > possible denial of service, similar to L2 stopping L1 with an emulation > failure. > > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: Don't report guest userspace emulation error to userspace, why ?
Commit a2b9e6c1a35afcc09: KVM: x86: Don't report guest userspace emulation error to userspace Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. I searched the archives but failed in finding anything. Can someone please explain why this is needed ? Or, why not let userspace decide what to do based on the cpl, whether to continue execution or kill the guest ? Is the assumption here that this is what userspace always wants ? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation
Haozhong Zhang <haozhong.zh...@intel.com> writes: > On 11/25/15 10:45, Bandan Das wrote: >> Haozhong Zhang <haozhong.zh...@intel.com> writes: >> >> > This patch removes the vpid check when emulating nested invvpid >> > instruction of type all-contexts invalidation. The existing code is >> > incorrect because: >> > (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate >> > Translations Based on VPID", invvpid instruction does not check >> > vpid in the invvpid descriptor when its type is all-contexts >> > invalidation. >> >> But iirc isn't vpid=0 reserved for root mode ? > Yes, > >> I think we don't want >> L1 hypervisor to be able do a invvpid(0). > > but the invvpid emulated here is doing the all-contexts invalidation that > does not use the given vpid and "invalidates all mappings tagged with all > VPIDs except VPID H" (from Intel SDM). Actually, vpid_sync_context will always try single invalidation first and I was concerned that we will end up calling vpid_sync_context(0). But that will not happen since nested.vpid02 is always allocated a vpid. So... we are good :) >> >> > (2) According to the same document, invvpid of type all-contexts >> > invalidation does not require there is an active VMCS, so/and >> > get_vmcs12() in the existing code may result in a NULL-pointer >> > dereference. In practice, it can crash both KVM itself and L1 >> > hypervisors that use invvpid (e.g. Xen). >> >> If that is the case, then just check if it's null and return without >> doing anything. > > (according to Intel SDM) invvpid of type all-contexts invalidation > should not trigger a valid vmx fail if vpid in the current VMCS is 0. No, I meant just skip instruction and return but I doubt if there's any overhead of flushing mappings that don't exist in the first place. Anyway, better to do as the spec says. > However, this check and its following operation do change this semantics > in nested VMX, so it should be completely removed. > >> >> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com> >> > --- >> > arch/x86/kvm/vmx.c | 5 - >> > 1 file changed, 5 deletions(-) >> > >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> > index 87acc52..af823a3 100644 >> > --- a/arch/x86/kvm/vmx.c >> > +++ b/arch/x86/kvm/vmx.c >> > @@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu) >> > >> >switch (type) { >> >case VMX_VPID_EXTENT_ALL_CONTEXT: >> > - if (get_vmcs12(vcpu)->virtual_processor_id == 0) { >> > - nested_vmx_failValid(vcpu, >> > - VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); >> > - return 1; >> > - } >> >__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02); >> >nested_vmx_succeed(vcpu); >> >break; >> >> I also noticed a BUG() here in the default. It might be a good idea to >> replace >> it with a WARN. > > Or, in nested_vmx_setup_ctls_msrs(): > > if (enable_vpid) > - vmx->nested.nested_vmx_vpid_caps = VMX_VPID_INVVPID_BIT | > - VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT; > + vmx->nested.nested_vmx_vpid_caps = VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT; > > because the current handle_invvpid() only handles all-contexts invalidation. > > Haozhong > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation
Haozhong Zhangwrites: > This patch removes the vpid check when emulating nested invvpid > instruction of type all-contexts invalidation. The existing code is > incorrect because: > (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate > Translations Based on VPID", invvpid instruction does not check > vpid in the invvpid descriptor when its type is all-contexts > invalidation. But iirc isn't vpid=0 reserved for root mode ? I think we don't want L1 hypervisor to be able do a invvpid(0). > (2) According to the same document, invvpid of type all-contexts > invalidation does not require there is an active VMCS, so/and > get_vmcs12() in the existing code may result in a NULL-pointer > dereference. In practice, it can crash both KVM itself and L1 > hypervisors that use invvpid (e.g. Xen). If that is the case, then just check if it's null and return without doing anything. > Signed-off-by: Haozhong Zhang > --- > arch/x86/kvm/vmx.c | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 87acc52..af823a3 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu) > > switch (type) { > case VMX_VPID_EXTENT_ALL_CONTEXT: > - if (get_vmcs12(vcpu)->virtual_processor_id == 0) { > - nested_vmx_failValid(vcpu, > - VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); > - return 1; > - } > __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02); > nested_vmx_succeed(vcpu); > break; I also noticed a BUG() here in the default. It might be a good idea to replace it with a WARN. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Joerg Roedel <j...@8bytes.org> writes: > On Tue, Oct 06, 2015 at 01:59:27PM -0400, Bandan Das wrote: >> Joerg Roedel <j...@8bytes.org> writes: >> > >> > So svm->vmcb->control.next_rip is only written by hardware or in >> > svm_check_intercept(). Both cases write only to this field, if the >> > hardware supports X86_FEATURE_NRIPS. The write in nested_svm_vmexit only >> >> Not until commit f104765b4f81fd74d69e0eb161e89096deade2db. So, an older L1 >> kernel will trigger it. > > But we don't care if L1 writes something into its own next_rip, as we > never read this value from its VMCB. We only copy the next_rip value we > get from our shadow-vmcb to it on an emulated vmexit. So I still don't > understand what triggers the reported problem or why the WARN_ON is > necessary. Ok, looks like I am making some incorrect "vmx" assumptions here. What happens when we exit from L2 to L0, arent' we looking at the VMCB L1 is using to run L2 ? Wouldn't that trigger the warning if the host processor does not support nrips and the field is set ? > > Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: svm: Only propagate next_rip when guest supports it
Joerg Roedelwrites: > On Wed, Oct 07, 2015 at 01:03:35PM +0200, Joerg Roedel wrote: >> But we don't care if L1 writes something into its own next_rip, as we >> never read this value from its VMCB. We only copy the next_rip value we >> get from our shadow-vmcb to it on an emulated vmexit. So I still don't >> understand what triggers the reported problem or why the WARN_ON is >> necessary. > > Okay, I think I have an idea now. I talked a bit with Dirk and the > WARN_ON triggers in the guest, and not on the host. This makes a lot > more sense. > > In nested-svm we always copy the next_rip from the shadow-vmcb to the > guests vmcb, even when the nrips bit in cpuid is not set for the guest. > This obviously triggers the WARN_ON() in the L1 KVM (I still don't > understand why the WARN_ON was introduced in the first place). Ok, understood now. The warn_on would trigger in L1 only if it has decided to disable nrips for some reason as was the case here. So, my reasoning behind putting the warning was incorrect. > So the right fix is to only copy next_rip to the guests vmcb when its > cpuid indicates that next_rip is supported there, like in this patch: Yep, agreed. > From 019afc60507618b8e44e0c67d5ea2d850d88c9dd Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Wed, 7 Oct 2015 13:38:19 +0200 > Subject: [PATCH] kvm: svm: Only propagate next_rip when guest supports it > > Currently we always write the next_rip of the shadow vmcb to > the guests vmcb when we emulate a vmexit. This could confuse > the guest when its cpuid indicated no support for the > next_rip feature. > > Fix this by only propagating next_rip if the guest actually > supports it. > > Signed-off-by: Joerg Roedel > --- > arch/x86/kvm/cpuid.h | 21 + > arch/x86/kvm/svm.c | 7 ++- > 2 files changed, 27 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h > index dd05b9c..effca1f 100644 > --- a/arch/x86/kvm/cpuid.h > +++ b/arch/x86/kvm/cpuid.h > @@ -133,4 +133,25 @@ static inline bool guest_cpuid_has_mpx(struct kvm_vcpu > *vcpu) > best = kvm_find_cpuid_entry(vcpu, 7, 0); > return best && (best->ebx & bit(X86_FEATURE_MPX)); > } > + > +/* > + * NRIPS is provided through cpuidfn 0x800a.edx bit 3 > + */ > +#define BIT_NRIPS3 > + > +static inline bool guest_cpuid_has_nrips(struct kvm_vcpu *vcpu) > +{ > + struct kvm_cpuid_entry2 *best; > + > + best = kvm_find_cpuid_entry(vcpu, 0x800a, 0); > + > + /* > + * NRIPS is a scattered cpuid feature, so we can't use > + * X86_FEATURE_NRIPS here (X86_FEATURE_NRIPS would be bit > + * position 8, not 3). > + */ > + return best && (best->edx & bit(BIT_NRIPS)); > +} > +#undef BIT_NRIPS > + > #endif > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 94b7d15..e1a8824 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -2459,7 +2459,9 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) > nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; > nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; > nested_vmcb->control.exit_int_info_err = > vmcb->control.exit_int_info_err; > - nested_vmcb->control.next_rip = vmcb->control.next_rip; > + > + if (guest_cpuid_has_nrips(vcpu)) > + nested_vmcb->control.next_rip = vmcb->control.next_rip; > > /* >* If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have > @@ -2714,6 +2716,9 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) > svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; > svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; > > + /* Clear next_rip, as real hardware would do */ > + nested_vmcb->control.next_rip = 0; > + Why do we need this ? And are you sure this is what real hardware does ? I couldn't find anything in the spec. > nested_svm_unmap(page); > > /* Enter Guest-Mode */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Joerg Roedel <j...@8bytes.org> writes: > On Thu, Oct 01, 2015 at 06:31:27PM -0400, Bandan Das wrote: >> >> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu >> >> *vcpu) >> >> struct vcpu_svm *svm = to_svm(vcpu); >> >> >> >> if (svm->vmcb->control.next_rip != 0) { >> >> - WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS)); >> >> + WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); >> >> svm->next_rip = svm->vmcb->control.next_rip; >> >> } > > I looked again how this could possibly be triggered, and I am somewhat > confused now. > > So svm->vmcb->control.next_rip is only written by hardware or in > svm_check_intercept(). Both cases write only to this field, if the > hardware supports X86_FEATURE_NRIPS. The write in nested_svm_vmexit only Not until commit f104765b4f81fd74d69e0eb161e89096deade2db. So, an older L1 kernel will trigger it. > targets the guests VMCB, and we don't use that one again. > > So I can't see how the WARN_ON above could be triggered. Do I miss > something or might this also be a miscompilation of static_cpu_has? > > > Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Joerg Roedel <j...@8bytes.org> writes: > On Mon, Oct 05, 2015 at 01:42:43PM -0400, Bandan Das wrote: >> Joerg Roedel <j...@8bytes.org> writes: >> >> > On Mon, Oct 05, 2015 at 12:54:43PM -0400, Bandan Das wrote: >> >> Joerg Roedel <j...@8bytes.org> writes: >> >> The problems is that the next_rip field could be stale. If the processor >> >> supports >> >> next_rip, then it will clear it out on the next entry. If it doesn't, >> >> an old value just sits there (no matter who wrote it) and the problem >> >> happens when skip_emulated_instruction advances the rip with an incorrect >> >> value. >> > >> > So the right fix would be to just set the guests next_rip to 0 when we >> > emulate vmrun, just like real hardware does, no? >> >> Agreed, resetting to 0 if nrips isn't supported seems right. It would still >> help having a printk_once in this case IMO :) > > I meant to reset it always to 0 on vmrun, like real hardware does. Atleast the spec don't mention this, I don't know how I got that idea :) The spec just say that it gets written to by hardware on certain intercepts and for others it gets reset to 0 on #VMEXIT. > > > Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Joerg Roedel <j...@8bytes.org> writes: > On Mon, Oct 05, 2015 at 12:54:43PM -0400, Bandan Das wrote: >> Joerg Roedel <j...@8bytes.org> writes: >> The problems is that the next_rip field could be stale. If the processor >> supports >> next_rip, then it will clear it out on the next entry. If it doesn't, >> an old value just sits there (no matter who wrote it) and the problem >> happens when skip_emulated_instruction advances the rip with an incorrect >> value. > > So the right fix would be to just set the guests next_rip to 0 when we > emulate vmrun, just like real hardware does, no? Agreed, resetting to 0 if nrips isn't supported seems right. It would still help having a printk_once in this case IMO :) > > Joerg > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Dirk Müllerwrites: >> So the right fix would be to just set the guests next_rip to 0 when we >> emulate vmrun, just like real hardware does, no? > > Like this? (Note: I’m not sure what I’m doing here..). I Agree with you that > the warning > for this seems excessive, I’ve just removed it. > > > From 47db81837b6fe58aa302317bbf2a092b40af0a36 Mon Sep 17 00:00:00 2001 > From: Dirk Mueller > Date: Fri, 2 Oct 2015 08:35:24 +0200 > Subject: [PATCH] KVM: nSVM: Check for NRIP support before passing on > control.next_rip > > NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS], so > ignore it if the cpu feature is not available. Remove the > guest-triggerable WARN_ON for this as it just emulates real > hardware behavior. > > Signed-off-by: Dirk Mueller > --- > arch/x86/kvm/svm.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 0a42859..66e3a4c 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -514,8 +514,10 @@ static void skip_emulated_instruction(struct kvm_vcpu > *vcpu) > struct vcpu_svm *svm = to_svm(vcpu); > > if (svm->vmcb->control.next_rip != 0) { > - WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); > - svm->next_rip = svm->vmcb->control.next_rip; > + if (static_cpu_has(X86_FEATURE_NRIPS)) > + svm->next_rip = svm->vmcb->control.next_rip; > + else > + svm->next_rip = 0; skip_emulated_instruction probably isn't the right place. > } So, L1 writes this field in svm_check_intercept. I went back to the spec and it says (15.7.1): The next sequential instruction pointer (nRIP) is saved in the guest VMCB control area at location C8h on all #VMEXITs that are due to instruction intercepts, as defined in Section 15.9 on page 458, as well as MSR and IOIO intercepts and exceptions caused by the INT3, INTO, and BOUND instructions. For all other intercepts, nRIP is reset to zero. So, I think a better way would be to reset the field on vmexit from L2 to L0 if NRIP isn't supported. Maybe it's enough to do this only for the intercepts mentioned above but I am not sure. Also, please include a debug message (pr_debug_once will do). It seems unnecessary today, but in 2 years when someone is scratching his head why his nested guests won't run, he will thank you! BTW, it seems possible to unconditionally advertise SVM_FEATURE_NRIP.. > if (!svm->next_rip) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Joerg Roedel <j...@8bytes.org> writes: > On Thu, Oct 01, 2015 at 06:31:27PM -0400, Bandan Das wrote: >> Paolo Bonzini <pbonz...@redhat.com> writes: >> >> > On 01/10/2015 13:43, Dirk Müller wrote: >> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >> >> index 94b7d15..0a42859 100644 >> >> --- a/arch/x86/kvm/svm.c >> >> +++ b/arch/x86/kvm/svm.c >> >> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu >> >> *vcpu) >> >> struct vcpu_svm *svm = to_svm(vcpu); >> >> >> >> if (svm->vmcb->control.next_rip != 0) { >> >> - WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS)); >> >> + WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); >> >> svm->next_rip = svm->vmcb->control.next_rip; >> >> } >> >> >> > >> > Bandan, what was the reason for warning here? >> >> I added the warning so that we catch if the next_rip field is being written >> to (even if the feature isn't supported) by a buggy L1 hypervisor. > > Even if the L1 hypervisor writes to the next_rip field in the VMCB, we > would never see it in this code path, as we access the shadow VMCB in > this statement. > > We don't even care if the L1 hypervisor writes to its next_rip field > because we only write to this field on an emulatated VMEXIT and never > read it back. The problems is that the next_rip field could be stale. If the processor supports next_rip, then it will clear it out on the next entry. If it doesn't, an old value just sits there (no matter who wrote it) and the problem happens when skip_emulated_instruction advances the rip with an incorrect value. > So what's the point in adding a guest-triggerable warning at all? So, yes, maybe this doesn't have to be a guest specific warning but we still need to warn if this unsupported field is being written to. > > > Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Dirk Müllerwrites: >> I added the warning so that we catch if the next_rip field is being written >> to (even if the feature isn't supported) by a buggy L1 hypervisor. > > Interesting, so how about this patch? > > > From c5c8ea255d680f972cbdfc835cdf352fa78897ae Mon Sep 17 00:00:00 2001 > From: Dirk Mueller > Date: Fri, 2 Oct 2015 08:35:24 +0200 > Subject: [PATCH] KVM: nSVM: Check for NRIP support before accepting > control.next_rip > > NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS], remove > a WARN_ON_(once) and check for it directly. > > Signed-off-by: Dirk Mueller > --- > arch/x86/kvm/svm.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 0a42859..33d36da 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -513,8 +513,8 @@ static void skip_emulated_instruction(struct kvm_vcpu > *vcpu) > { > struct vcpu_svm *svm = to_svm(vcpu); > > - if (svm->vmcb->control.next_rip != 0) { > - WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); > + if (static_cpu_has(X86_FEATURE_NRIPS) && > + svm->vmcb->control.next_rip != 0) { > svm->next_rip = svm->vmcb->control.next_rip; > } Ok, looks good to me. Still, probably a good idea to let the user know if this condition is hit. Bandan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
Paolo Bonziniwrites: > On 01/10/2015 13:43, Dirk Müller wrote: >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >> index 94b7d15..0a42859 100644 >> --- a/arch/x86/kvm/svm.c >> +++ b/arch/x86/kvm/svm.c >> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu >> *vcpu) >> struct vcpu_svm *svm = to_svm(vcpu); >> >> if (svm->vmcb->control.next_rip != 0) { >> -WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS)); >> +WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); >> svm->next_rip = svm->vmcb->control.next_rip; >> } >> > > Bandan, what was the reason for warning here? I added the warning so that we catch if the next_rip field is being written to (even if the feature isn't supported) by a buggy L1 hypervisor. >From the commit: - if (svm->vmcb->control.next_rip != 0) + if (svm->vmcb->control.next_rip != 0) { + WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS)); svm->next_rip = svm->vmcb->control.next_rip; + } if (!svm->next_rip) { if (emulate_instruction(vcpu, EMULTYPE_SKIP) != @@ -4317,7 +4319,9 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu, break; } - vmcb->control.next_rip = info->next_rip; + /* TODO: Advertise NRIPS to guest hypervisor unconditionally */ + if (static_cpu_has(X86_FEATURE_NRIPS)) + vmcb->control.next_rip = info->next_rip; vmcb->control.exit_code = icpt_info.exit_code; vmexit = nested_svm_exit_handled(svm); ... > Should we change the "if" condition to static_cpu_has(X86_FEATURE_NRIPS) > instead of Dirk's patch? Yes, seems ok to me. If decodeassist isn't supported then it's mostly a stale value. It's interesting that that L1 still works even after we hit this warning! > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: emulate the INVVPID instruction
Paolo Bonzini <pbonz...@redhat.com> writes: > On 24/09/2015 17:45, Bandan Das wrote: >> > However, I have applied the patch to kvm/queue. Please send the changes >> > separately, and I will squash them in the existing VPID patch. >> >> Please don't do this. It's making it really difficult to review these >> patches individually :( Why not let them get some review time before >> applying them all together ? > > Ok---I did it because it makes sense to keep this patch separate from > the others. You can expose VPID even if vpid02 == vpid01 (in fact > that's what happens if KVM cannot find a vpid02) and in that case this > patch provides a valid implementation of INVVPID. > > Do you think it would help if I posted the whole kvm/queue contents a > few days before pushing it to kvm/next? Oh that would be great. Thank you! > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: emulate the INVVPID instruction
Paolo Bonziniwrites: ... >> @@ -7189,7 +7189,28 @@ static int handle_invept(struct kvm_vcpu *vcpu) >> >> static int handle_invvpid(struct kvm_vcpu *vcpu) >> { >> -kvm_queue_exception(vcpu, UD_VECTOR); >> +u32 vmx_instruction_info; >> +unsigned long type; >> + >> +if (!nested_vmx_check_permission(vcpu)) >> +return 1; >> + >> +vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); >> +type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf); >> + >> +switch (type) { >> +case VMX_VPID_EXTENT_INDIVIDUAL_ADDR: >> +case VMX_VPID_EXTENT_SINGLE_CONTEXT: >> +case VMX_VPID_EXTENT_ALL_CONTEXT: >> +vmx_flush_tlb(vcpu); >> +nested_vmx_succeed(vcpu); >> +break; >> +default: >> +nested_vmx_failInvalid(vcpu); >> +break; >> +} >> + >> +skip_emulated_instruction(vcpu); >> return 1; >> } >> >> > > This is not enough. You need to add a VPID argument to > vpid_sync_vcpu_single, and inline vmx_flush_tlb in handle_invvpid so > that it can use the new VPID argument of vpid_sync_vcpu_single. > > Note that the "all context" variant can be mapped to > vpid_sync_vcpu_single with vpid02 as the argument (a nice side effect of > your vpid02 design). > > However, I have applied the patch to kvm/queue. Please send the changes > separately, and I will squash them in the existing VPID patch. Please don't do this. It's making it really difficult to review these patches individually :( Why not let them get some review time before applying them all together ? > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: introduce __vmx_flush_tlb to handle specific vpid
Wanpeng Liwrites: > Introduce __vmx_flush_tlb() to handle specific vpid. > > Signed-off-by: Wanpeng Li > --- > arch/x86/kvm/vmx.c | 21 + > 1 file changed, 13 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 794c529..7188c5e 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -1343,13 +1343,13 @@ static void loaded_vmcs_clear(struct loaded_vmcs > *loaded_vmcs) >__loaded_vmcs_clear, loaded_vmcs, 1); > } > > -static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx) > +static inline void vpid_sync_vcpu_single(int vpid) > { > - if (vmx->vpid == 0) > + if (vpid == 0) > return; > > if (cpu_has_vmx_invvpid_single()) > - __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0); > + __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vpid, 0); > } > > static inline void vpid_sync_vcpu_global(void) > @@ -1358,10 +1358,10 @@ static inline void vpid_sync_vcpu_global(void) > __invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0); > } > > -static inline void vpid_sync_context(struct vcpu_vmx *vmx) > +static inline void vpid_sync_context(int vpid) > { > if (cpu_has_vmx_invvpid_single()) > - vpid_sync_vcpu_single(vmx); > + vpid_sync_vcpu_single(vpid); > else > vpid_sync_vcpu_global(); > } Not sure myself what's the right thing to do but this may be undesirable in a nested environment. Assuming the processor supports global invalidation only, this seems like a easy way for the nested guest to invalidate *all* mappings - even the L1 specific mappings. > @@ -3450,9 +3450,9 @@ static void exit_lmode(struct kvm_vcpu *vcpu) > > #endif > > -static void vmx_flush_tlb(struct kvm_vcpu *vcpu) > +static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid) > { > - vpid_sync_context(to_vmx(vcpu)); > + vpid_sync_context(vpid); > if (enable_ept) { > if (!VALID_PAGE(vcpu->arch.mmu.root_hpa)) > return; > @@ -3460,6 +3460,11 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu) > } > } > > +static void vmx_flush_tlb(struct kvm_vcpu *vcpu) > +{ > + __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid); > +} > + > static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) > { > ulong cr0_guest_owned_bits = vcpu->arch.cr0_guest_owned_bits; > @@ -4795,7 +4800,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool > init_event) > vmx_fpu_activate(vcpu); > update_exception_bitmap(vcpu); > > - vpid_sync_context(vmx); > + vpid_sync_context(vmx->vpid); > } > > /* -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: nested VPID emulation
Wanpeng Liwrites: > VPID is used to tag address space and avoid a TLB flush. Currently L0 use > the same VPID to run L1 and all its guests. KVM flushes VPID when switching > between L1 and L2. > > This patch advertises VPID to the L1 hypervisor, then address space of L1 and > L2 can be separately treated and avoid TLB flush when swithing between L1 and > L2. This patch gets ~3x performance improvement for lmbench 8p/64k ctxsw. TLB flush does context invalidation and while that should result in some improvement, I never expected a 3x improvement for any workload! Interesting :) > Signed-off-by: Wanpeng Li > --- > arch/x86/kvm/vmx.c | 39 --- > 1 file changed, 32 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index da1590e..06bc31e 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -1157,6 +1157,11 @@ static inline bool > nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12) > return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); > } > > +static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12) > +{ > + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID); > +} > + > static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12) > { > return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT); > @@ -2471,6 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx > *vmx) > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | > SECONDARY_EXEC_RDTSCP | > SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | > + SECONDARY_EXEC_ENABLE_VPID | > SECONDARY_EXEC_APIC_REGISTER_VIRT | > SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | > SECONDARY_EXEC_WBINVD_EXITING | > @@ -4160,7 +4166,7 @@ static void allocate_vpid(struct vcpu_vmx *vmx) > int vpid; > > vmx->vpid = 0; > - if (!enable_vpid) > + if (!enable_vpid || is_guest_mode(>vcpu)) > return; > spin_lock(_vpid_lock); > vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS); > @@ -6738,6 +6744,14 @@ static int handle_vmclear(struct kvm_vcpu *vcpu) > } > vmcs12 = kmap(page); > vmcs12->launch_state = 0; > + if (enable_vpid) { > + if (nested_cpu_has_vpid(vmcs12)) { > + spin_lock(_vpid_lock); > + if (vmcs12->virtual_processor_id != 0) > + __clear_bit(vmcs12->virtual_processor_id, > vmx_vpid_bitmap); > + spin_unlock(_vpid_lock); > + } > + } > kunmap(page); > nested_release_page(page); I don't think this is enough, we should also check for set "nested" bits in free_vpid() and clear them. There should be some sort of a mapping between the nested guest vpid and the actual vpid so that we can just clear those bits. > @@ -9189,6 +9203,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, > struct vmcs12 *vmcs12) > { > struct vcpu_vmx *vmx = to_vmx(vcpu); > u32 exec_control; > + int vpid; > > vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector); > vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); > @@ -9438,13 +9453,21 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, > struct vmcs12 *vmcs12) > else > vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > + Empty space here. > if (enable_vpid) { > - /* > - * Trivially support vpid by letting L2s share their parent > - * L1's vpid. TODO: move to a more elaborate solution, giving > - * each L2 its own vpid and exposing the vpid feature to L1. > - */ > - vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid); > + if (nested_cpu_has_vpid(vmcs12)) { > + if (vmcs12->virtual_processor_id == 0) { Ok, so if we advertise vpid to the nested hypervisor, isn't it going to attempt writing this field when setting up ? Atleast that's what Linux does, no ? > + spin_lock(_vpid_lock); > + vpid = find_first_zero_bit(vmx_vpid_bitmap, > VMX_NR_VPIDS); > + if (vpid < VMX_NR_VPIDS) > + __set_bit(vpid, vmx_vpid_bitmap); > + spin_unlock(_vpid_lock); > + vmcs_write16(VIRTUAL_PROCESSOR_ID, vpid); > + } else > + vmcs_write16(VIRTUAL_PROCESSOR_ID, > vmcs12->virtual_processor_id); > + } else > + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid); > + I guess L1 shouldn't know what vpid L0 chose to run L2. If L1 vmreads, it should get what it expects for the value of vpid, not the one L0 chose. > vmx_flush_tlb(vcpu); >
Re: [RFC PATCH 4/5] KVM: x86: enable unhandled MSR exits for vmx
Peter Hornyack peterhorny...@google.com writes: Set the vm's unhandled_msr_exits flag when user space calls the KVM_ENABLE_CAP ioctl with KVM_CAP_UNHANDLED_MSR_EXITS. After kvm fails to handle a guest rdmsr or wrmsr, check this flag and exit to user space with KVM_EXIT_MSR rather than immediately injecting a GP fault. On reentry into kvm, use the complete_userspace_io callback path to call vmx_complete_userspace_msr. Complete the MSR access if user space was able to handle it successfully, or fail the MSR access and inject a GP fault if user space could not handle the access. ... +static int vmx_complete_userspace_msr(struct kvm_vcpu *vcpu) +{ + struct msr_data msr; + + if (vcpu-run-msr.index != vcpu-arch.regs[VCPU_REGS_RCX]) { + pr_debug(msr.index 0x%x changed, does not match ecx 0x%lx\n, + vcpu-run-msr.index, + vcpu-arch.regs[VCPU_REGS_RCX]); + goto err_out; + } + msr.index = vcpu-run-msr.index; + msr.data = vcpu-run-msr.data; + msr.host_initiated = false; + + switch (vcpu-run-msr.direction) { + case KVM_EXIT_MSR_RDMSR: + if (vcpu-run-msr.handled == KVM_EXIT_MSR_HANDLED) + complete_rdmsr(vcpu, msr); + else + fail_rdmsr(vcpu, msr); + break; Should we check for MSR_UNHANDLED ? Could be debugging relief if userspace is filling it up with random crap! I think it should be ok if the trace function catches that too. + case KVM_EXIT_MSR_WRMSR: + if (vcpu-run-msr.handled == KVM_EXIT_MSR_HANDLED) + complete_wrmsr(vcpu, msr); + else + fail_wrmsr(vcpu, msr); + break; + default: + pr_debug(bad msr.direction %u\n, vcpu-run-msr.direction); + goto err_out; + } + + return 1; +err_out: + vcpu-run-exit_reason = KVM_EXIT_MSR; + vcpu-run-msr.direction = KVM_EXIT_MSR_COMPLETION_FAILED; + return 0; +} + +/* + * Returns 1 if the rdmsr handling is complete; returns 0 if kvm should exit to + * user space to handle this rdmsr. + */ static int handle_rdmsr(struct kvm_vcpu *vcpu) { u32 ecx = vcpu-arch.regs[VCPU_REGS_RCX]; @@ -5499,6 +5547,16 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu) msr_info.index = ecx; msr_info.host_initiated = false; if (vmx_get_msr(vcpu, msr_info)) { + if (vcpu-kvm-arch.unhandled_msr_exits) { + vcpu-run-exit_reason = KVM_EXIT_MSR; + vcpu-run-msr.direction = KVM_EXIT_MSR_RDMSR; + vcpu-run-msr.index = msr_info.index; + vcpu-run-msr.data = 0; + vcpu-run-msr.handled = 0; + vcpu-arch.complete_userspace_io = + vmx_complete_userspace_msr; + return 0; + } fail_rdmsr(vcpu, msr_info); return 1; } @@ -5508,6 +5566,10 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu) return 1; } +/* + * Returns 1 if the wrmsr handling is complete; returns 0 if kvm should exit to + * user space to handle this wrmsr. + */ static int handle_wrmsr(struct kvm_vcpu *vcpu) { struct msr_data msr; @@ -5519,6 +5581,16 @@ static int handle_wrmsr(struct kvm_vcpu *vcpu) msr.index = ecx; msr.host_initiated = false; if (kvm_set_msr(vcpu, msr) != 0) { + if (vcpu-kvm-arch.unhandled_msr_exits) { I think kvm should ultimately decide which msrs it can let userspace handle even if userspace has explicitly enabled the ioctl. Bandan + vcpu-run-exit_reason = KVM_EXIT_MSR; + vcpu-run-msr.direction = KVM_EXIT_MSR_WRMSR; + vcpu-run-msr.index = msr.index; + vcpu-run-msr.data = msr.data; + vcpu-run-msr.handled = 0; + vcpu-arch.complete_userspace_io = + vmx_complete_userspace_msr; + return 0; + } fail_wrmsr(vcpu, msr); return 1; } @@ -8163,7 +8235,7 @@ static bool vmx_xsaves_supported(void) static bool vmx_msr_exits_supported(void) { - return false; + return true; } static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4bbc2a1676c9..5c22f4655741 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2461,6 +2461,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ENABLE_CAP_VM: case KVM_CAP_DISABLE_QUIRKS: case KVM_CAP_SET_BOOT_CPU_ID: + case KVM_CAP_UNHANDLED_MSR_EXITS: #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT case KVM_CAP_ASSIGN_DEV_IRQ: case KVM_CAP_PCI_2_3: @@ -3568,6
Re: [RFC PATCH 3/5] KVM: x86: add msr_exits_supported to kvm_x86_ops
Peter Hornyack peterhorny...@google.com writes: msr_exits_supported will be checked when user space attempts to enable the KVM_CAP_UNHANDLED_MSR_EXITS capability for the vm. This is needed because MSR exit support will be implemented for vmx but not svm later in this patchset. Is svm future work ? :) Are there any such svm msrs ? Signed-off-by: Peter Hornyack peterhorny...@google.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 6 ++ 3 files changed, 13 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c12e845f59e6..a6e145b1e271 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -854,6 +854,7 @@ struct kvm_x86_ops { void (*handle_external_intr)(struct kvm_vcpu *vcpu); bool (*mpx_supported)(void); bool (*xsaves_supported)(void); + bool (*msr_exits_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 74d825716f4f..bcbb56f49b9f 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4249,6 +4249,11 @@ static bool svm_xsaves_supported(void) return false; } +static bool svm_msr_exits_supported(void) +{ + return false; +} + static bool svm_has_wbinvd_exit(void) { return true; @@ -4540,6 +4545,7 @@ static struct kvm_x86_ops svm_x86_ops = { .invpcid_supported = svm_invpcid_supported, .mpx_supported = svm_mpx_supported, .xsaves_supported = svm_xsaves_supported, + .msr_exits_supported = svm_msr_exits_supported, .set_supported_cpuid = svm_set_supported_cpuid, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index acc38e27d221..27fec385d79d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8161,6 +8161,11 @@ static bool vmx_xsaves_supported(void) SECONDARY_EXEC_XSAVES; } +static bool vmx_msr_exits_supported(void) +{ + return false; +} + static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx) { u32 exit_intr_info; @@ -10413,6 +10418,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .handle_external_intr = vmx_handle_external_intr, .mpx_supported = vmx_mpx_supported, .xsaves_supported = vmx_xsaves_supported, + .msr_exits_supported = vmx_msr_exits_supported, .check_nested_events = vmx_check_nested_events, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] KVM: x86: exit to user space on unhandled MSR accesses
Peter Hornyack peterhorny...@google.com writes: On Wed, Aug 19, 2015 at 2:43 PM, Bandan Das b...@redhat.com wrote: Peter Hornyack peterhorny...@google.com writes: There are numerous MSRs that kvm does not currently handle. On Intel platforms we have observed guest VMs accessing some of these MSRs (for example, MSR_PLATFORM_INFO) and behaving poorly (to the point of guest OS crashes) when they receive a GP fault because the MSR is not emulated. This patchset adds a new kvm exit path for unhandled MSR accesses that allows user space to emulate additional MSRs without having to implement them in kvm. ^^ So, I am trying to understand this motivation. A while back when a patch was posted to emulate MSR_PLATFORM_INFO, it was rejected. Why ? Because it seemed impossible to emulate it correctly (most concerns were related to migration iirc). Although I haven't reviewed all patches in this series yet, what I understand from the above message is-It's ok to emulate MSR_PLATFORM_INFO *incorrectly* as long as we are doing it in the user space. I understand the part where it makes sense to move stuff to userspace. But if kvm isn't emulating certain msrs yet, either we should add support, or they haven't been added because it's not possible to emulate them correctly. The logic that it's probably ok to let userspace do the (incorrect) emulation is something I don't understand. I wasn't aware of the past discussion of MSR_PLATFORM_INFO, I'll search for it - thanks for pointing it out. MSR_PLATFORM_INFO is just one example though. In addition to the benefits I already mentioned for user space MSR emulation (agility, reduced attack surface), this patchset offers a benefit even if user space declines to emulate the MSR by returning into kvm with KVM_EXIT_MSR_UNHANDLED: it makes it easier to monitor in the first place, via logging mechanisms in user space instead of in the kernel, for when guests are accessing MSRs that kvm doesn't implement. Agreed, that would be more useful then how it's being handled currently. This patchset has already revealed a few other MSRs (IA32_MPERF, IA32_APERF) that guests in our test environment are accessing but which kvm doesn't implement yet. I haven't yet investigated deeply what these MSRs are used for and how they might be emulated (or if they can't actually be emulated correctly), but if we do discover an unimplemented non-performance-sensitive MSR that we could emulate correctly, with this patchset we would quickly just implement it in user space. Ok, makes sense. But it seems like this could be an area of abuse as well. For example, this could lessen the incentive to add emulation for new msrs in kvm and rather just emulate them in userspace ? However, as you mentioned, it's probably ok as long as there's a clear demarcation of which msrs this is done for. This shouldn't be the default behavior. On a different note, if there's a related qemu change, can you please point me to it ? Thanks, Bandan It seems like the next in line is to let userspace emulate thier own version of unimplemented x86 instructions. The core of the patchset modifies the vmx handle_rdmsr and handle_wrmsr functions to exit to user space on MSR reads/writes that kvm can't handle itself. Then, on the return path into kvm we check for outstanding user space MSR completions and either complete the MSR access successfully or inject a GP fault as kvm would do by default. This new exit path must be enabled for the vm via the KVM_CAP_UNHANDLED_MSR_EXITS capability. In the future we plan to extend this functionality to allow user space to register the MSRs that it would like to handle itself, even if kvm already provides an implementation. In the long-term we will move the I seriously hope we don't do this! Bandan implementation of all non-performance-sensitive MSRs to user space, reducing the potential attack surface of kvm and allowing us to respond to bugs more quickly. This patchset has been tested with our non-qemu user space hypervisor on vmx platforms; svm support is not implemented. Peter Hornyack (5): KVM: x86: refactor vmx rdmsr/wrmsr completion into new functions KVM: add KVM_EXIT_MSR exit reason and capability. KVM: x86: add msr_exits_supported to kvm_x86_ops KVM: x86: enable unhandled MSR exits for vmx KVM: x86: add trace events for unhandled MSR exits Documentation/virtual/kvm/api.txt | 48 +++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm.c| 6 ++ arch/x86/kvm/trace.h | 28 + arch/x86/kvm/vmx.c| 126 ++ arch/x86/kvm/x86.c| 13 include/trace/events/kvm.h| 2 +- include/uapi/linux/kvm.h | 14 + 8 files changed, 227 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message
Re: [RFC PATCH 0/5] KVM: x86: exit to user space on unhandled MSR accesses
Peter Hornyack peterhorny...@google.com writes: There are numerous MSRs that kvm does not currently handle. On Intel platforms we have observed guest VMs accessing some of these MSRs (for example, MSR_PLATFORM_INFO) and behaving poorly (to the point of guest OS crashes) when they receive a GP fault because the MSR is not emulated. This patchset adds a new kvm exit path for unhandled MSR accesses that allows user space to emulate additional MSRs without having to implement them in kvm. ^^ So, I am trying to understand this motivation. A while back when a patch was posted to emulate MSR_PLATFORM_INFO, it was rejected. Why ? Because it seemed impossible to emulate it correctly (most concerns were related to migration iirc). Although I haven't reviewed all patches in this series yet, what I understand from the above message is-It's ok to emulate MSR_PLATFORM_INFO *incorrectly* as long as we are doing it in the user space. I understand the part where it makes sense to move stuff to userspace. But if kvm isn't emulating certain msrs yet, either we should add support, or they haven't been added because it's not possible to emulate them correctly. The logic that it's probably ok to let userspace do the (incorrect) emulation is something I don't understand. It seems like the next in line is to let userspace emulate thier own version of unimplemented x86 instructions. The core of the patchset modifies the vmx handle_rdmsr and handle_wrmsr functions to exit to user space on MSR reads/writes that kvm can't handle itself. Then, on the return path into kvm we check for outstanding user space MSR completions and either complete the MSR access successfully or inject a GP fault as kvm would do by default. This new exit path must be enabled for the vm via the KVM_CAP_UNHANDLED_MSR_EXITS capability. In the future we plan to extend this functionality to allow user space to register the MSRs that it would like to handle itself, even if kvm already provides an implementation. In the long-term we will move the I seriously hope we don't do this! Bandan implementation of all non-performance-sensitive MSRs to user space, reducing the potential attack surface of kvm and allowing us to respond to bugs more quickly. This patchset has been tested with our non-qemu user space hypervisor on vmx platforms; svm support is not implemented. Peter Hornyack (5): KVM: x86: refactor vmx rdmsr/wrmsr completion into new functions KVM: add KVM_EXIT_MSR exit reason and capability. KVM: x86: add msr_exits_supported to kvm_x86_ops KVM: x86: enable unhandled MSR exits for vmx KVM: x86: add trace events for unhandled MSR exits Documentation/virtual/kvm/api.txt | 48 +++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm.c| 6 ++ arch/x86/kvm/trace.h | 28 + arch/x86/kvm/vmx.c| 126 ++ arch/x86/kvm/x86.c| 13 include/trace/events/kvm.h| 2 +- include/uapi/linux/kvm.h | 14 + 8 files changed, 227 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One off question: Hot vertical scaling of a KVM?
ROZUMNY, VICTOR vr9...@att.com writes: Hello- Your site states that one off questions might could be answered via email so here I am. I have limited knowledge of KVM, but it is my understanding that in order to vertically scale RAM for a KVM the guest needs to be shut down, resized, and rebooted, resulting in a 5-10 minute interruption of service. What's vertical scaling ? You mean, increase the amount of memory the guest sees ? You could use Qemu memory hotplug I think but that requires that the guest has been booted with enough number of dimm slots. Bandan Is this true and if so do you know of any efforts to change this behavior? Thank you! v. Vic Rozumny | Principal Technical Architect AIC - ATT Integrated Cloud | Complex Engineering -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debian 8
Mathieu Charrette mathieu.charre...@phoenixsolutions.ca writes: Hi, I'm trying to install KVM on the latest version of Debian (so Debian 8) and it's not working. It always gives me a dependencies error. But when I install it on Debian 7.8, it works no problem. I was wondering if you would have any solution for me. This is a distro question. Try the Debian forums. http://forums.debian.net/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
Michael S. Tsirkin m...@redhat.com writes: On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote: vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com BTW, how does this interact with virtio net MQ? It would seem that MQ gains from more parallelism and CPU locality. Hm.. Good point. As of this version, this design will always have one worker thread servicing a guest. Now suppose we have 10 virtio queues for a guest, surely, we could benefit from spawning off another worker just like we are doing in case of a new guest/device with the devs_per_worker parameter. --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); -vhost_work_queue(vs-dev, vs-vs_completion_work); +vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); -vhost_work_queue(vs-dev, vs-vs_event_work); +vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); -vhost_work_flush(vs-dev, vs-vs_completion_work); -vhost_work_flush(vs-dev, vs-vs_event_work); +vhost_work_flush(vs-dev.worker, vs-vs_completion_work); +vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; -vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); -vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); +vhost_work_init(vs-dev, vs-vs_completion_work, +vhost_scsi_complete_cmd_work); + +vhost_work_init(vs-dev, vs-vs_event_work, +vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, + struct vhost_work *work, vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; +work-dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll-dev = dev; poll-wqh = NULL; -vhost_work_init(poll-work, fn); +vhost_work_init(dev, poll-work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work, -unsigned seq) +static bool vhost_work_seq_done(struct vhost_worker *worker, +struct
Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
Bandan Das b...@redhat.com writes: Michael S. Tsirkin m...@redhat.com writes: On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote: vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com BTW, how does this interact with virtio net MQ? It would seem that MQ gains from more parallelism and CPU locality. Hm.. Good point. As of this version, this design will always have one worker thread servicing a guest. Now suppose we have 10 virtio queues for a guest, surely, we could benefit from spawning off another worker just like we are doing in case of a new guest/device with the devs_per_worker parameter. So, I did a quick smoke test with virtio-net and the Elvis patches. virtio net MQ already spawns a new worker thread for every queue, it seems ? So, the above setup already works! :) I will run some tests and post back the results. --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); - vhost_work_queue(vs-dev, vs-vs_completion_work); + vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); - vhost_work_queue(vs-dev, vs-vs_event_work); + vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); - vhost_work_flush(vs-dev, vs-vs_completion_work); - vhost_work_flush(vs-dev, vs-vs_event_work); + vhost_work_flush(vs-dev.worker, vs-vs_completion_work); + vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); + vhost_work_init(vs-dev, vs-vs_completion_work, + vhost_scsi_complete_cmd_work); + + vhost_work_init(vs-dev, vs-vs_event_work, + vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, +struct vhost_work *work, vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; + work-dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll-dev = dev; poll-wqh = NULL; - vhost_work_init(poll-work, fn); + vhost_work_init(dev, poll-work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop
Re: [RFC PATCH 0/4] Shared vhost design
Michael S. Tsirkin m...@redhat.com writes: On Sat, Aug 08, 2015 at 07:06:38PM -0400, Bandan Das wrote: Hi Michael, ... - does the design address the issue of VM 1 being blocked (e.g. because it hits swap) and blocking VM 2? Good question. I haven't thought of this yet. But IIUC, the worker thread will complete VM1's job and then move on to executing VM2's scheduled work. It doesn't matter if VM1 is blocked currently. I think it would be a problem though if/when polling is introduced. Sorry, I wasn't clear. If VM1's memory is in swap, attempts to access it might block the service thread, so it won't complete VM2's job. Ah ok, I understand now. I am pretty sure the current RFC doesn't take care of this :) I will add this to my todo list for v2. Bandan #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed. # I/O thread runs on CPU 14 or 15 depending on which guest it's serving There's a simple graph at http://people.redhat.com/~bdas/elvis/data/results.png that shows how task affinity results in a jump and even without it, as the number of guests increase, the shared vhost design performs slightly better. Observations: 1. In terms of stock performance, the results are comparable. 2. However, with a tuned setup, even without polling, we see an improvement with the new design. 3. Making the new design simulate old behavior would be a matter of setting the number of guests per vhost threads to 1. 4. Maybe, setting a per guest limit on the work being done by a specific vhost thread is needed for it to be fair. 5. cgroup associations needs to be figured out. I just slightly hacked the current cgroup association mechanism to work with the new model. Ccing cgroups for input/comments. Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial patches, the helpful testing suggestions and discussions. Bandan Das (4): vhost: Introduce a universal thread to serve all users vhost: Limit the number of devices served by a single worker thread cgroup: Introduce a function to compare cgroups vhost: Add cgroup-aware creation of worker threads drivers/vhost/net.c| 6 +- drivers/vhost/scsi.c | 18 ++-- drivers/vhost/vhost.c | 272 +++-- drivers/vhost/vhost.h | 32 +- include/linux/cgroup.h | 1 + kernel/cgroup.c| 40 6 files changed, 275 insertions(+), 94 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
Eyal Moscovici eya...@il.ibm.com writes: Hi, Do you know what is the overhead of switching the vhost thread from one cgroup to another? I misinterpreted this question earlier. I think what you are asking here is that when the vm process is moved from one cgroup to another, what is the overhead of moving the vhost thread to the new cgroup. This design does not provide any hooks for the vhost thread to move to a new cgroup. Rather, I think a better approach is to create a new vhost thread and bind the process to it if the process is migrated to a new cgroup. This is much less complicated, and there's a good chance that it's impossible to migrate the vhost thread since it's serving other guests. I will address this in v2. Eyal Moscovici HL-Cloud Infrastructure Solutions IBM Haifa Research Lab From: Bandan Das b...@redhat.com To: kvm@vger.kernel.org Cc: net...@vger.kernel.org, linux-ker...@vger.kernel.org, m...@redhat.com, Eyal Moscovici/Haifa/IBM@IBMIL, Razya Ladelsky/Haifa/IBM@IBMIL, cgro...@vger.kernel.org, jasow...@redhat.com Date: 07/13/2015 07:08 AM Subject:[RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); -vhost_work_queue(vs-dev, vs-vs_completion_work); +vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); -vhost_work_queue(vs-dev, vs-vs_event_work); +vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); -vhost_work_flush(vs-dev, vs-vs_completion_work); -vhost_work_flush(vs-dev, vs-vs_event_work); +vhost_work_flush(vs-dev.worker, vs-vs_completion_work); +vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; -vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); -vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); +vhost_work_init(vs-dev, vs-vs_completion_work, + vhost_scsi_complete_cmd_work); + +vhost_work_init(vs-dev, vs-vs_event_work, +vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync
Re: [RFC PATCH 0/4] Shared vhost design
Hi Michael, Michael S. Tsirkin m...@redhat.com writes: On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote: Hello, There have been discussions on improving the current vhost design. The first attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost worker per cgroup. http://comments.gmane.org/gmane.linux.network/224730 Later, I posted a cmwq based approach for performance comparisions http://comments.gmane.org/gmane.linux.network/286858 More recently was the Elvis work that was presented in KVM Forum 2013 http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf The Elvis patches rely on common vhost thread design for scalability along with polling for performance. Since there are two major changes being proposed, we decided to split up the work. The first (this RFC), proposing a re-design of the vhost threading model and the second part (not posted yet) to focus more on improving performance. I am posting this with the hope that we can have a meaningful discussion on the proposed new architecture. We have run some tests to show that the new design is scalable and in terms of performance, is comparable to the current stable design. Test Setup: The testing is based on the setup described in the Elvis proposal. The initial tests are just an aggregate of Netperf STREAM and MAERTS but as we progress, I am happy to run more tests. The hosts are two identical 16 core Haswell systems with point to point network links. For the first 10 runs, with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8 and mem=12G. The purpose was to do a comparision of resource utilization and how it affects performance. Finally, with the number of guests set at 14, I didn't limit the number of CPUs booted on the host or limit memory seen by the kernel but boot the kernel with isolcpus=14,15 that will be used to run the vhost threads. The guests are pinned to cpus 0-13 and based on which cpu the guest is running on, the corresponding I/O thread is either pinned to cpu 14 or 15. Results # X axis is number of guests # Y axis is netperf number # nr_cpus=8 and mem=12G #Number of Guests#Baseline#ELVIS 11119.3.0 2 1135.6 1130.2 3 1135.5 1131.6 4 1136.0 1127.1 5 1118.6 1129.3 6 1123.4 1129.8 7 1128.7 1135.4 8 1129.9 1137.5 9 1130.6 1135.1 101129.3 1138.9 14* 1173.8 1216.9 I'm a bit too busy now, with 2.4 and related stuff, will review once we finish 2.4. But I'd like to ask two things: - did you actually test a config where cgroups were used? Here are some numbers with a simple cgroup setup. Three cgroups with cpusets cpu=0,2,4 for cgroup1, cpu=1,3,5 for cgroup2 and cpu=6,7 for cgroup3 (even though 6,7 have different numa nodes) I run netperf for 1 to 9 guests starting with assigning the first guest to cgroup1, second to cgroup2, third to cgroup3 and repeat this sequence upto 9 guests. The numbers - (TCP_STREAM + TCP_MAERTS)/2 #Number of Guests #ELVIS (Mbps) 11056.9 21122.5 31122.8 41123.2 51122.6 61110.3 71116.3 81121.8 91118.5 Maybe, my cgroup setup was too simple but these numbers are comparable to the no cgroups results above. I wrote some tracing code to trace cgroup_match_groups() and find cgroup search overhead but it seemed unnecessary for this particular test. - does the design address the issue of VM 1 being blocked (e.g. because it hits swap) and blocking VM 2? Good question. I haven't thought of this yet. But IIUC, the worker thread will complete VM1's job and then move on to executing VM2's scheduled work. It doesn't matter if VM1 is blocked currently. I think it would be a problem though if/when polling is introduced. #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed. # I/O thread runs on CPU 14 or 15 depending on which guest it's serving There's a simple graph at http://people.redhat.com/~bdas/elvis/data/results.png that shows how task affinity results in a jump and even without it, as the number of guests increase, the shared vhost design performs slightly better. Observations: 1. In terms of stock performance, the results are comparable. 2. However, with a tuned setup, even without polling, we see an improvement with the new design. 3. Making the new design
Re: [PATCH kvm-unit-tests] x86: fix last commit
Shih-Wei Li shih...@cs.columbia.edu writes: Hi Paolo, I've tried to apply the patch, and found that it passed most of the problematic tests I mentioned earlier (IPI related, kvmclock_test). However, it stopped still at s3 and couldn't finish it. Do you know what might go wrong? Nothing is wrong, that's the way the test is. You need to resume from qemu for it to proceed and it should quit with 1 for error or 0 for success. Thanks, Shih-Wei On Thu, Jul 30, 2015 at 9:38 AM, Paolo Bonzini pbonz...@redhat.com wrote: Keeping the selector that was loaded from the 32-bit GDT is okay, because only code segment descriptors differ between 32- and 64-bit mode. In fact the same is true for %ss as well, so let's just remove the whole segment loading from load_tss. Thanks to Bandan Das for debugging. Reported-by: Shih-Wei Li shih...@cs.columbia.edu Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- x86/cstart64.S | 6 -- 1 file changed, 6 deletions(-) diff --git a/x86/cstart64.S b/x86/cstart64.S index 8d5ee2d..e947888 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -213,12 +213,6 @@ idt_descr: load_tss: lidtq idt_descr - mov $0x10, %eax - mov %ax, %ds - mov %ax, %es - mov %ax, %fs - mov %ax, %gs - mov %ax, %ss mov $(APIC_DEFAULT_PHYS_BASE + APIC_ID), %eax mov (%rax), %eax shr $24, %eax -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Shared vhost design
Eyal Moscovici eya...@il.ibm.com writes: ... We can start to implement polling, but I am unsure if the cgroups integration will be sufficient. The polling vhost thread should be scheduled all the time and just moving it from one cgroup to the other wont be sufficient. I think it needs a deeper integration to the point where either we have a vhost thread for each cgroup or the vhost thread enforces the cgroup policies over its polled VM guests. Agreed, what we have with cgroups is not sufficient. I am waiting for Tejun et al to comment on our approach :) Michael mentioned whether it's possible to integrate cgroups into workgroups which I think is a more generic and the preferred solution. I just don't know yet how easy/difficult it is to implement this with the new cgroups unified hierarchy. BTW, I am working on the numbers you had asked for. Honestly, I think the cost of cgroups could be similar to running a vhost thread/guest since that is how cgroups integration currently works. But it's good to have the numbers before us. So simple polling by vhost is kind of ok for some guests, but I think to really make it work for a reasonably wide selection of guests/workloads you need to combine it with 1. polling the NIC - it makes no sense to me to only poll one side of the equation; and probably 2. - polling in guest. I agree that we need polling on the NIC which could probably be achieved by using the polling interface introduced in kernel 3.11: http://lwn.net/Articles/551284/ although I never tried using it myself. About your point about polling inside the guest, I think it is orthogonal to polling in the host. Eyal Moscovici HL-Cloud Infrastructure Solutions IBM Haifa Research Lab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] x86: fix last commit
Paolo Bonzini pbonz...@redhat.com writes: On 01/08/2015 21:05, Bandan Das wrote: Shih-Wei Li shih...@cs.columbia.edu writes: Hi Paolo, I've tried to apply the patch, and found that it passed most of the problematic tests I mentioned earlier (IPI related, kvmclock_test). However, it stopped still at s3 and couldn't finish it. Do you know what might go wrong? Nothing is wrong, that's the way the test is. You need to resume from qemu for it to proceed and it should quit with 1 for error or 0 for success. Actually it should be using the RTC alarm to wake itself up. But the firmware changed recently and the ACPI PMBASE moved from 0xb000 to 0x600. Try this (untested): Ah thanks! your patch works for me. Is this one of the static entries in the ACPI tables ? I am wondering if we can read this value so it works for everybody. diff --git a/x86/s3.c b/x86/s3.c index d568aa7..d6cfef3 100644 --- a/x86/s3.c +++ b/x86/s3.c @@ -177,7 +177,7 @@ int main(int argc, char **argv) rtc_out(RTC_REG_B, rtc_in(RTC_REG_B) | REG_B_AIE); *(volatile int*)0 = 0; - asm volatile(outw %0, %1 :: a((short)0x2400), d((short)0xb004):memory); + asm volatile(outw %0, %1 :: a((short)0x2400), d((short)0x604):memory); while(1) *(volatile int*)0 = 1; It's on my todo list to fix a very similar issue in vmexit.flat. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-unit-test failed to complete
Shih-Wei Li shih...@cs.columbia.edu writes: On Wed, Jul 29, 2015 at 2:38 PM, Shih-Wei Li shih...@cs.columbia.edu wrote: On Wed, Jul 29, 2015 at 2:30 PM, Christoffer Dall cd...@cs.columbia.edu wrote: Hi Shih-Wei, [Something weird happened when sending these e-mails, you sent two where one seems to be a slight modification of the other?] yes, the previous one just got rejected by the mailing list. sorry about the spam. On Wed, Jul 29, 2015 at 02:18:23PM -0400, Shih-Wei Li wrote: Hi all, This is Shih-Wei, I'm Christoffer's colleague at Columbia University. We have experienced some problems in running kvm-unit-tests in our environment. Here's what we did: ./configure make ./run_test.sh run_test.sh halted in some specific test items and couldn't finish the run. I managed to get it finish running after removing the following items: x86/apic.c: - removed: test_sti_nmi(); test_multiple_nmi(); I'm wondering if there's a dependency on the version of QEMU? Have you tried with the most recent upstream QEMU? I haven't, but I will try. We are using qemu 2.2.50 now. Here's the update. I just used the upstream QEMU to run kvm-unit-test, but it still failed to finish the same test items I mentioned earlier. (Just a cursory look) It seems this commit introduced it - commit 402d4596789335f540a0697955f6dcf43e2bb796 Author: Paolo Bonzini pbonz...@redhat.com Date: Thu Jul 23 09:16:45 2015 +0200 x86: load 64-bit segments into the segment registers kvm-unit-tests was keeping DS/ES/FS/GS loaded with the segment descriptors provided by the multiboot boot loader (which are 32-bit), and instead loading SS with 0. The vmx.flat test failed because KVM did not like doing writes into such an SS. Load again the segment registers after entering 64-bit mode, for both the BSP and the APs. Signed-off-by: Paolo Bonzini pbonz...@redhat.com diff --git a/x86/cstart64.S b/x86/cstart64.S index 8d0d95d..8d5ee2d 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -213,7 +213,11 @@ idt_descr: load_tss: lidtq idt_descr - mov $0, %eax + mov $0x10, %eax + mov %ax, %ds + mov %ax, %es + mov %ax, %fs + mov %ax, %gs mov %ax, %ss mov $(APIC_DEFAULT_PHYS_BASE + APIC_ID), %eax mov (%rax), %eax Paolo, what is the purpose of initializing %gs and %fs ? It seems if I comment out mov %ax, %gs, the test works. handle_irq() does seem to trigger but the handler doesn't seem to get called if %gs is initialized. Weird. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Shared vhost design
Eyal Moscovici eya...@il.ibm.com writes: Hi, The test showed the same relative numbers as we got in our internal testing. I was wondering about the configuration in regards to NUMA. From Thanks for confirming. our testing we saw that if the VMs are spread across 2 NUMA nodes then having a shared vhost thread per node performs better then having the two threads in the same core. IIUC, this is similar to my test setup and observations i.e 14* 1173.8 1216.9 In this case, there's a shared vhost thread on CPU 14 for numa node 0 and another on CPU 15 for numa node 1. Guests running on CPUs 0,2,4,6,8,10,12 are serviced by vhost-0 that runs on CPU 14 and guests running on CPUs 1,3,5,7,9,11,13 get serviced by vhost-1 (Numa node 1). I tried some other configurations but this configuration gave me the best results. Eyal, I think it makes sense to add polling on top of these patches and get numbers for them too. Thoughts ? Bandan Eyal Moscovici HL-Cloud Infrastructure Solutions IBM Haifa Research Lab From: Bandan Das b...@redhat.com To: kvm@vger.kernel.org Cc: net...@vger.kernel.org, linux-ker...@vger.kernel.org, m...@redhat.com, Eyal Moscovici/Haifa/IBM@IBMIL, Razya Ladelsky/Haifa/IBM@IBMIL, cgro...@vger.kernel.org, jasow...@redhat.com Date: 07/13/2015 07:08 AM Subject:[RFC PATCH 0/4] Shared vhost design Hello, There have been discussions on improving the current vhost design. The first attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost worker per cgroup. http://comments.gmane.org/gmane.linux.network/224730 Later, I posted a cmwq based approach for performance comparisions http://comments.gmane.org/gmane.linux.network/286858 More recently was the Elvis work that was presented in KVM Forum 2013 http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf The Elvis patches rely on common vhost thread design for scalability along with polling for performance. Since there are two major changes being proposed, we decided to split up the work. The first (this RFC), proposing a re-design of the vhost threading model and the second part (not posted yet) to focus more on improving performance. I am posting this with the hope that we can have a meaningful discussion on the proposed new architecture. We have run some tests to show that the new design is scalable and in terms of performance, is comparable to the current stable design. Test Setup: The testing is based on the setup described in the Elvis proposal. The initial tests are just an aggregate of Netperf STREAM and MAERTS but as we progress, I am happy to run more tests. The hosts are two identical 16 core Haswell systems with point to point network links. For the first 10 runs, with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8 and mem=12G. The purpose was to do a comparision of resource utilization and how it affects performance. Finally, with the number of guests set at 14, I didn't limit the number of CPUs booted on the host or limit memory seen by the kernel but boot the kernel with isolcpus=14,15 that will be used to run the vhost threads. The guests are pinned to cpus 0-13 and based on which cpu the guest is running on, the corresponding I/O thread is either pinned to cpu 14 or 15. Results # X axis is number of guests # Y axis is netperf number # nr_cpus=8 and mem=12G #Number of Guests#Baseline#ELVIS 11119.3.0 2 1135.6 1130.2 3 1135.5 1131.6 4 1136.0 1127.1 5 1118.6 1129.3 6 1123.4 1129.8 7 1128.7 1135.4 8 1129.9 1137.5 9 1130.6 1135.1 101129.3 1138.9 14* 1173.8 1216.9 #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed. # I/O thread runs on CPU 14 or 15 depending on which guest it's serving There's a simple graph at http://people.redhat.com/~bdas/elvis/data/results.png that shows how task affinity results in a jump and even without it, as the number of guests increase, the shared vhost design performs slightly better. Observations: 1. In terms of stock performance, the results are comparable. 2. However, with a tuned setup, even without polling, we see an improvement with the new design. 3. Making the new design simulate old behavior would be a matter of setting the number of guests per vhost threads to 1. 4
Re: max number of VCPUs
Ozgur O Kilic okil...@binghamton.edu writes: Hi, I was trying to find out the max number of vpcus and what is the reason of having that. I found 2 number about it one is the recommend max by KVM which is 160 and the other number was 255. Can someone tell me what are the reasons of this limitations? Why 160 is the recommended max and why the max number is 255? 255 comes from apic limitation and 160 because that's the max someone has ever tested ? Bandan Thank you. Ozgur Ozan Kilic -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 01/10] KVM: MMU: fix decoding cache type from MTRR
Alex Williamson alex.william...@redhat.com writes: On Wed, 2015-05-13 at 14:42 +0800, Xiao Guangrong wrote: There are some bugs in current get_mtrr_type(); 1: bit 1 of mtrr_state-enabled is corresponding bit 11 of IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement that means other bits are ignored if it is cleared 2: the fixed MTRR ranges are controlled by bit 0 of mtrr_state-enabled (bit 10 of IA32_MTRR_DEF_TYPE) 3: if MTRR is disabled, UC is applied to all of physical memory rather than mtrr_state-def_type Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- arch/x86/kvm/mmu.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) I'm seeing a significant regression in boot performance on Intel hardware with assigned devices that bisects back to this patch. There's a long delay with Seabios between the version splash and execution of option ROMs, and a _very_ long delay with OVMF before the display is initialized. The delay is long enough that users are reporting their previously working VM is hung with 100% CPU usage on v4.2-rc1. Thanks, Alex diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b78e83f..d00cebd 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2393,19 +2393,20 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page); static int get_mtrr_type(struct mtrr_state_type *mtrr_state, u64 start, u64 end) { -int i; u64 base, mask; u8 prev_match, curr_match; -int num_var_ranges = KVM_NR_VAR_MTRR; +int i, num_var_ranges = KVM_NR_VAR_MTRR; -if (!mtrr_state-enabled) -return 0xFF; +/* MTRR is completely disabled, use UC for all of physical memory. */ +if (!(mtrr_state-enabled 0x2)) +return MTRR_TYPE_UNCACHABLE; Setting this to UC if MTRR is disabled is too restrictive maybe.. Bandan /* Make end inclusive end, instead of exclusive */ end--; /* Look in fixed ranges. Just return the type as per start */ -if (mtrr_state-have_fixed (start 0x10)) { +if (mtrr_state-have_fixed (mtrr_state-enabled 0x1) + (start 0x10)) { int idx; if (start 0x8) { @@ -2428,9 +2429,6 @@ static int get_mtrr_type(struct mtrr_state_type *mtrr_state, * Look of multiple ranges matching this address and pick type * as per MTRR precedence */ -if (!(mtrr_state-enabled 2)) -return mtrr_state-def_type; - prev_match = 0xFF; for (i = 0; i num_var_ranges; ++i) { unsigned short start_state, end_state; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/4] vhost: Limit the number of devices served by a single worker thread
When the number of devices increase, the universal thread model (introduced in the preceding patch) may end up being the bottleneck. Moreover, a single worker thread also forces us to change cgroups based on the device we are serving. We introduce a worker pool struct that starts with one worker thread and we keep adding more threads when the numbers of devs reaches a certain threshold. The default value is set at 7 but is not based on any empirical data. The value can also be changed by the user with the devs_per_worker module parameter. Note that this patch doesn't change how cgroups work. We still keep moving around the worker thread to the cgroups of the device we are serving at the moment. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com --- drivers/vhost/net.c | 6 +-- drivers/vhost/scsi.c | 3 +- drivers/vhost/vhost.c | 135 +- drivers/vhost/vhost.h | 13 - 4 files changed, 128 insertions(+), 29 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7d137a4..7bfa019 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -705,7 +705,8 @@ static int vhost_net_open(struct inode *inode, struct file *f) n-vqs[i].vhost_hlen = 0; n-vqs[i].sock_hlen = 0; } - vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); + if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX)) + return dev-err; vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); @@ -801,9 +802,6 @@ static int vhost_net_release(struct inode *inode, struct file *f) sockfd_put(rx_sock); /* Make sure no callbacks are outstanding */ synchronize_rcu_bh(); - /* We do an extra flush before freeing memory, -* since jobs can re-queue themselves. */ - vhost_net_flush(n); kfree(n-dev.vqs); kvfree(n); return 0; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 6c42936..97de2db 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1601,7 +1601,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vqs[i] = vs-vqs[i].vq; vs-vqs[i].vq.handle_kick = vhost_scsi_handle_kick; } - vhost_dev_init(vs-dev, vqs, VHOST_SCSI_MAX_VQ); + if (vhost_dev_init(vs-dev, vqs, VHOST_SCSI_MAX_VQ)) + return vs-dev.err; vhost_scsi_init_inflight(vs, NULL); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 951c96b..6a5d4c0 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -27,11 +27,19 @@ #include linux/kthread.h #include linux/cgroup.h #include linux/module.h +#include linux/moduleparam.h #include vhost.h -/* Just one worker thread to service all devices */ -static struct vhost_worker *worker; +static int __read_mostly devs_per_worker = 7; +module_param(devs_per_worker, int, S_IRUGO); +MODULE_PARM_DESC(devs_per_worker, Setup the number of devices being served by a worker thread); + +/* Only used to give a unique id to a vhost thread at the moment */ +static unsigned int total_vhost_workers; + +/* Pool of vhost threads */ +static struct vhost_pool *vhost_pool; enum { VHOST_MEMORY_MAX_NREGIONS = 64, @@ -270,6 +278,63 @@ static int vhost_worker(void *data) return 0; } +static void vhost_create_worker(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + struct vhost_pool *pool = vhost_pool; + + worker = kzalloc(sizeof(*worker), GFP_KERNEL); + if (!worker) { + dev-err = -ENOMEM; + return; + } + + worker-thread = kthread_create(vhost_worker, + worker, + vhost-%d, + total_vhost_workers); + if (IS_ERR(worker-thread)) { + dev-err = PTR_ERR(worker-thread); + goto therror; + } + + spin_lock_init(worker-work_lock); + INIT_LIST_HEAD(worker-work_list); + list_add(worker-node, pool-workers); + worker-owner = NULL; + worker-num_devices++; + total_vhost_workers++; + dev-worker = worker; + dev-worker_assigned = true; + return; + +therror: + if (worker-thread) + kthread_stop(worker-thread); + kfree(worker); +} + +static int vhost_dev_assign_worker(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + + mutex_lock(vhost_pool-pool_lock); + list_for_each_entry(worker, vhost_pool-workers, node) { + if (worker-num_devices devs_per_worker) { + dev-worker = worker; + dev-worker_assigned = true; + worker-num_devices++; + break
[RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); - vhost_work_queue(vs-dev, vs-vs_completion_work); + vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); - vhost_work_queue(vs-dev, vs-vs_event_work); + vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); - vhost_work_flush(vs-dev, vs-vs_completion_work); - vhost_work_flush(vs-dev, vs-vs_event_work); + vhost_work_flush(vs-dev.worker, vs-vs_completion_work); + vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); + vhost_work_init(vs-dev, vs-vs_completion_work, + vhost_scsi_complete_cmd_work); + + vhost_work_init(vs-dev, vs-vs_event_work, + vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, +struct vhost_work *work, vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; + work-dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll-dev = dev; poll-wqh = NULL; - vhost_work_init(poll-work, fn); + vhost_work_init(dev, poll-work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work, - unsigned seq) +static bool vhost_work_seq_done(struct vhost_worker *worker, + struct vhost_work *work, unsigned seq) { int left; - spin_lock_irq(dev-work_lock); + spin_lock_irq(worker-work_lock); left = seq - work-done_seq; - spin_unlock_irq(dev-work_lock); + spin_unlock_irq(worker-work_lock); return left = 0; } -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work) +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work) { unsigned seq; int flushing; - spin_lock_irq(dev-work_lock
[RFC PATCH 4/4] vhost: Add cgroup-aware creation of worker threads
With the help of the cgroup function to compare groups introduced in the previous patch, this changes worker creation policy. If the new device belongs to different cgroups than any of the devices we are currently serving, we end up creating a new worker thread even if we haven't reached the devs_per_worker threshold Signed-off-by: Bandan Das b...@redhat.com --- drivers/vhost/vhost.c | 47 +++ 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 6a5d4c0..dc0fa37 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -261,12 +261,6 @@ static int vhost_worker(void *data) use_mm(dev-mm); } - /* TODO: Consider a more elegant solution */ - if (worker-owner != dev-owner) { - /* Should check for return value */ - cgroup_attach_task_all(dev-owner, current); - worker-owner = dev-owner; - } work-fn(work); if (need_resched()) schedule(); @@ -278,6 +272,36 @@ static int vhost_worker(void *data) return 0; } +struct vhost_attach_cgroups_struct { + struct vhost_work work; + struct task_struct *owner; + int ret; +}; + +static void vhost_attach_cgroups_work(struct vhost_work *work) +{ + struct vhost_attach_cgroups_struct *s; + + s = container_of(work, struct vhost_attach_cgroups_struct, work); + s-ret = cgroup_attach_task_all(s-owner, current); +} + +static void vhost_attach_cgroups(struct vhost_dev *dev, + struct vhost_worker *worker) +{ + struct vhost_attach_cgroups_struct attach; + + attach.owner = dev-owner; + vhost_work_init(dev, attach.work, vhost_attach_cgroups_work); + vhost_work_queue(worker, attach.work); + vhost_work_flush(worker, attach.work); + + if (!attach.ret) + worker-owner = dev-owner; + + dev-err = attach.ret; +} + static void vhost_create_worker(struct vhost_dev *dev) { struct vhost_worker *worker; @@ -300,8 +324,14 @@ static void vhost_create_worker(struct vhost_dev *dev) spin_lock_init(worker-work_lock); INIT_LIST_HEAD(worker-work_list); + + /* attach to the cgroups of the process that created us */ + vhost_attach_cgroups(dev, worker); + if (dev-err) + goto therror; + worker-owner = dev-owner; + list_add(worker-node, pool-workers); - worker-owner = NULL; worker-num_devices++; total_vhost_workers++; dev-worker = worker; @@ -320,7 +350,8 @@ static int vhost_dev_assign_worker(struct vhost_dev *dev) mutex_lock(vhost_pool-pool_lock); list_for_each_entry(worker, vhost_pool-workers, node) { - if (worker-num_devices devs_per_worker) { + if (worker-num_devices devs_per_worker + (!cgroup_match_groups(dev-owner, worker-owner))) { dev-worker = worker; dev-worker_assigned = true; worker-num_devices++; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/4] cgroup: Introduce a function to compare cgroups
This function takes two tasks and iterates through all hierarchies to check if they belong to the same cgroups. It ignores the check for default hierarchies or for hierarchies with no subsystems attached. This function will be used by the next patch to add rudimentary cgroup support with vhost workers. Signed-off-by: Bandan Das b...@redhat.com --- include/linux/cgroup.h | 1 + kernel/cgroup.c| 40 2 files changed, 41 insertions(+) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b9cb94c..606fb5b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -933,6 +933,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, struct task_struct *css_task_iter_next(struct css_task_iter *it); void css_task_iter_end(struct css_task_iter *it); +int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 469dd54..ba4121e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2465,6 +2465,46 @@ out_unlock_cgroup: } /** + * cgroup_match_groups - check if tsk1 and tsk2 belong to + * same cgroups in all hierarchies + * Returns 0 on success + */ +int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2) +{ + struct cgroup_root *root; + int retval = 0; + + WARN_ON(!tsk1 || !tsk2); + + mutex_lock(cgroup_mutex); + for_each_root(root) { + struct cgroup *cg_tsk1; + struct cgroup *cg_tsk2; + + /* Default hierarchy */ + if (root == cgrp_dfl_root) + continue; + /* No subsystems attached */ + if (!root-subsys_mask) + continue; + + down_read(css_set_rwsem); + cg_tsk1 = task_cgroup_from_root(tsk1, root); + cg_tsk2 = task_cgroup_from_root(tsk2, root); + up_read(css_set_rwsem); + + if (cg_tsk1 != cg_tsk2) { + retval = 1; + break; + } + } + mutex_unlock(cgroup_mutex); + + return retval; +} +EXPORT_SYMBOL_GPL(cgroup_match_groups); + +/** * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from' * @from: attach to all cgroups of a given task * @tsk: the task to be attached -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/4] Shared vhost design
Hello, There have been discussions on improving the current vhost design. The first attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost worker per cgroup. http://comments.gmane.org/gmane.linux.network/224730 Later, I posted a cmwq based approach for performance comparisions http://comments.gmane.org/gmane.linux.network/286858 More recently was the Elvis work that was presented in KVM Forum 2013 http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf The Elvis patches rely on common vhost thread design for scalability along with polling for performance. Since there are two major changes being proposed, we decided to split up the work. The first (this RFC), proposing a re-design of the vhost threading model and the second part (not posted yet) to focus more on improving performance. I am posting this with the hope that we can have a meaningful discussion on the proposed new architecture. We have run some tests to show that the new design is scalable and in terms of performance, is comparable to the current stable design. Test Setup: The testing is based on the setup described in the Elvis proposal. The initial tests are just an aggregate of Netperf STREAM and MAERTS but as we progress, I am happy to run more tests. The hosts are two identical 16 core Haswell systems with point to point network links. For the first 10 runs, with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8 and mem=12G. The purpose was to do a comparision of resource utilization and how it affects performance. Finally, with the number of guests set at 14, I didn't limit the number of CPUs booted on the host or limit memory seen by the kernel but boot the kernel with isolcpus=14,15 that will be used to run the vhost threads. The guests are pinned to cpus 0-13 and based on which cpu the guest is running on, the corresponding I/O thread is either pinned to cpu 14 or 15. Results # X axis is number of guests # Y axis is netperf number # nr_cpus=8 and mem=12G #Number of Guests#Baseline#ELVIS 11119.3 .0 21135.6 1130.2 31135.5 1131.6 41136.0 1127.1 51118.6 1129.3 61123.4 1129.8 71128.7 1135.4 81129.9 1137.5 91130.6 1135.1 10 1129.3 1138.9 14* 1173.8 1216.9 #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed. # I/O thread runs on CPU 14 or 15 depending on which guest it's serving There's a simple graph at http://people.redhat.com/~bdas/elvis/data/results.png that shows how task affinity results in a jump and even without it, as the number of guests increase, the shared vhost design performs slightly better. Observations: 1. In terms of stock performance, the results are comparable. 2. However, with a tuned setup, even without polling, we see an improvement with the new design. 3. Making the new design simulate old behavior would be a matter of setting the number of guests per vhost threads to 1. 4. Maybe, setting a per guest limit on the work being done by a specific vhost thread is needed for it to be fair. 5. cgroup associations needs to be figured out. I just slightly hacked the current cgroup association mechanism to work with the new model. Ccing cgroups for input/comments. Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial patches, the helpful testing suggestions and discussions. Bandan Das (4): vhost: Introduce a universal thread to serve all users vhost: Limit the number of devices served by a single worker thread cgroup: Introduce a function to compare cgroups vhost: Add cgroup-aware creation of worker threads drivers/vhost/net.c| 6 +- drivers/vhost/scsi.c | 18 ++-- drivers/vhost/vhost.c | 272 +++-- drivers/vhost/vhost.h | 32 +- include/linux/cgroup.h | 1 + kernel/cgroup.c| 40 6 files changed, 275 insertions(+), 94 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add host physical address width capability
Laszlo Ersek ler...@redhat.com writes: On 07/10/15 16:59, Paolo Bonzini wrote: On 10/07/2015 16:57, Laszlo Ersek wrote: ... In any case, please understand that I'm not campaigning for this warning :) IIRC the warning was your (very welcome!) idea after I reported the problem; I'm just trying to ensure that the warning match the exact issue I encountered. Yup. I think the right thing to do would be to hide memory above the limit. How so? - The stack would not be doing what the user asks for. Pass -m a_lot, and the guest would silently see less memory. If the user found out, he'd immediately ask (or set out debugging) why. I think if the user's request cannot be satisfied, the stack should fail hard. That's another possibility. I think both of them are wrong depending on _why_ you're using -m a lot in the first place. Considering that this really happens (on Xeons) only for 1TB+ guests, I reported this issue because I ran into it with a ~64GB guest. From my /proc/cpuinfo: model name : Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz address sizes : 36 bits physical, 48 bits virtual I was specifically developing 64GB+ support for OVMF, and this limitation caused me to think that there was a bug in my OVMF patches. (There wasn't.) An error message from QEMU, advising me to turn off EPT, would have saved me many hours. Right, I specifically reserved a system with 36 bits physical to reproduce this and it was very easy to reproduce. If it's a hardware bug, I would say, it's a very annoying one (if not serious). I wonder if Intel folks can chime in. Thanks Laszlo it's probably just for debugging and then hiding the memory makes some sense. Actually, I agree with Laszlo here. Hiding memory is synonymous to forcing the user to use less for the -m argument as is failing. But failing and letting the user do it himself can save hours of debugging. Regards, The confused teenager who can't make up his mind. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add host physical address width capability
Paolo Bonzini pbonz...@redhat.com writes: On 09/07/2015 08:43, Laszlo Ersek wrote: On 07/09/15 08:09, Paolo Bonzini wrote: On 09/07/2015 00:36, Bandan Das wrote: Let userspace inquire the maximum physical address width of the host processors; this can be used to identify maximum memory that can be assigned to the guest. Reported-by: Laszlo Ersek ler...@redhat.com Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/x86.c | 3 +++ include/uapi/linux/kvm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bbaf44e..97d6746 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_NR_MEMSLOTS: r = KVM_USER_MEM_SLOTS; break; + case KVM_CAP_PHY_ADDR_WIDTH: + r = boot_cpu_data.x86_phys_bits; + break; Userspace can just use CPUID, can't it? I believe KVM's cooperation is necessary, for the following reason: The truncation only occurs when the guest-phys - host-phys translation is done in hardware, *and* the phys bits of the host processor are insufficient to represent the highest guest-phys address that the guest will ever face. The first condition (of course) means that the truncation depends on EPT being enabled. (I didn't test on AMD so I don't know if RVI has the same issue.) If EPT is disabled, either because the host processor lacks it, or because the respective kvm_intel module parameter is set so, then the issue cannot be experienced. Therefore I believe a KVM patch is necessary. However, this specific patch doesn't seem sufficient; it should also consider whether EPT is enabled. (And the ioctl should be perhaps renamed to reflect that -- what QEMU needs to know is not the raw physical address width of the host processor, but whether that width will cause EPT to silently truncate high guest-phys addresses.) Right; if you want to consider whether EPT is enabled (which is the right thing to do, albeit it makes for a much bigger patch) a KVM patch is necessary. In that case you also need to patch the API documentation. Note that this patch really doesn't do anything except for printing a message that something might potentially go wrong. Without EPT, you don't hit the processor limitation with your setup, but the user should nevertheless still be notified. In fact, I think shadow paging code should also emulate this behavior if the gpa is out of range. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width
Laszlo Ersek ler...@redhat.com writes: ... First, see my comments on the KVM patch. Second, ram_size is not the right thing to compare. What should be checked is whether the highest guest-physical address that maps to RAM can be represented in the address width of the host processor (and only if EPT is enabled, but that sub-condition belongs to the KVM patch). Note that this is not the same as the check written in the patch. For example, if you assume a 32-bit PCI hole with size 1 GB, then a total guest RAM of size 63 GB will result in the highest guest-phys memory address being 0xF__, which just fits into 36 bits. Correspondingly, the above code would not print the warning for -m $((63 * 1024 + 1)) on my laptop (which has address sizes : 36 bits physical, ...), even though such a guest would not boot for me (with EPT enabled). Please see http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447 So, ram_size in the controlling expression should be replaced with maximum_guest_ram_address (which should be inclusive, and the = relop should be preserved). also with memory hotplug tuned on we should check if the end of hotplug memory area is less then limit, i.e.: pcms-hotplug_memory.base + hotplug_mem_size 1ULL max_phys_bits Seems reasonable, thanks for the hint! Thanks Igor and Laszlo, makes sense. I am wondering if this 1GB PCI hole is always fixed so that I can simply include that in calculating the maximum guest ram address ? Or do we have to figure that out every time ? (The LHS in this instance is exclusive though, so equality should *not* trigger the warning. maximum_guest_ram_address is inclusive, and equality should trigger the warning. (Although equality seems quite impossible in practice.)) Thanks! Laszlo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add host physical address width capability
Bandan Das b...@redhat.com writes: Laszlo Ersek ler...@redhat.com writes: ... Yes. Without EPT, you don't hit the processor limitation with your setup, but the user should nevertheless still be notified. I disagree. In fact, I think shadow paging code should also emulate this behavior if the gpa is out of range. I disagree. There is no out of range gpa. QEMU allocates enough memory, and it should be completely transparent to the guest. The fact that it silently breaks with nested paging if the host processor doesn't have enough address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not sure, but I suspect it's a hardware bug). In any case the guest shouldn't care at all. It is a *virtual* machine, and the VMM should lie to it plausibly enough. How much RAM, and how many phys address bits the host has, is a performance question, but it should not be a correctness question. A 256 GB guest should run (slowly, but correctly) on a laptop that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap space. Because otherwise your argument could be extrapolated as TCG should break too if the gpa is 'out of range'. So, I disagree. Whatever memory you give to the guest should just work (unless of course you want to emulate a small address width for the *VCPU*, but that's absolutely not the use case here). What we have here is a leaky abstraction: a PCPU limitation giving away a lie that the guest should never notice. The guest should be able to use all memory that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT vs. KVM-with-EPT. If the last case cannot work (due to hardware limitations), that's fine, but then (and only then) a warning should be printed. Hmm... Ok, I understand your point. So, this is more like a EPT limitation/bug in that Qemu isn't complaining about the memory assigned to the guest but EPT code is breaking owing to the processor physical address width. And honestly, I now think that this patch just makes the whole situation more confusing :) I am wondering if it's just possible for kvm to simply throw an error like a EPT misconfiguration or something .. Or in other words, if using a hardware assisted mechanism is just not possible, KVM will simply not let it run instead of letting a guest stuck in boot. I noticed that when the guest gets stuck, trace shows an endless loop of EXTERNAL_INTERRUPT exits with code 14 (PF). There's a note in 28.2.2 of the spec that No processors supporting the Intel64 architecture support more than 48 physical-address bits.. An attempt to use such an address causes a page fault. So, my first guess was to print out the guest physical address. That seems to be well beyond the range and is always 0xff000 (when the guest is stuck). The other thing I can think of is the EPT entries have bits in the 51:N range set which is reserved and always 0. I haven't verified but it looks like there's ept_misconfig_inspect_spte() that should already catch this condition. I am out of ideas for today :) ... In any case, please understand that I'm not campaigning for this warning :) IIRC the warning was your (very welcome!) idea after I reported the problem; I'm just trying to ensure that the warning match the exact issue I encountered. Thanks! Laszlo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] target-i386: Sanity check host processor physical address width
Paolo Bonzini pbonz...@redhat.com writes: On 09/07/2015 10:26, Laszlo Ersek wrote: Perhaps KVM could simply hide memory above the limit (i.e. treat it as MMIO), and the BIOS could remove RAM above the limit from the e820 memory map? I'd prefer to leave the guest firmware*s* out of this... :) E820 is a legacy BIOS concept. In OVMF we'd have to hack the memory resource descriptor HOBs (which in turn control the DXE memory space map, which in turn controls the UEFI memory map). Those HOBs are currently based on what the CMOS reports about the RAM available under and above 4GB. It's pretty complex already (will get more complex with SMM support), and TBH, for working around such an obscure issue, I wouldn't like to complicate it even further... After all, this is a host platform limitation. The solution should be to either move to a more capable host, or do it in software (disable EPT). The reason I mentioned the firmware is because you could in principle have the same issue on real hardware - say putting 128 GB on your laptop. The firmware should cope with it. Agreed, it's probably not a good idea to deviate too much from how real hardware would behave IMO. As a simplification of Paolo's idea, is it possible for qemu to completely ignore memory above the limit ? Will that break anything ? :) If OVMF does not use etc/e820, it can instead hack the values it reads from CMOS, bounding them according to the CPUID value. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add host physical address width capability
Laszlo Ersek ler...@redhat.com writes: ... Yes. Without EPT, you don't hit the processor limitation with your setup, but the user should nevertheless still be notified. I disagree. In fact, I think shadow paging code should also emulate this behavior if the gpa is out of range. I disagree. There is no out of range gpa. QEMU allocates enough memory, and it should be completely transparent to the guest. The fact that it silently breaks with nested paging if the host processor doesn't have enough address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not sure, but I suspect it's a hardware bug). In any case the guest shouldn't care at all. It is a *virtual* machine, and the VMM should lie to it plausibly enough. How much RAM, and how many phys address bits the host has, is a performance question, but it should not be a correctness question. A 256 GB guest should run (slowly, but correctly) on a laptop that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap space. Because otherwise your argument could be extrapolated as TCG should break too if the gpa is 'out of range'. So, I disagree. Whatever memory you give to the guest should just work (unless of course you want to emulate a small address width for the *VCPU*, but that's absolutely not the use case here). What we have here is a leaky abstraction: a PCPU limitation giving away a lie that the guest should never notice. The guest should be able to use all memory that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT vs. KVM-with-EPT. If the last case cannot work (due to hardware limitations), that's fine, but then (and only then) a warning should be printed. Hmm... Ok, I understand your point. So, this is more like a EPT limitation/bug in that Qemu isn't complaining about the memory assigned to the guest but EPT code is breaking owing to the processor physical address width. And honestly, I now think that this patch just makes the whole situation more confusing :) I am wondering if it's just possible for kvm to simply throw an error like a EPT misconfiguration or something .. Or in other words, if using a hardware assisted mechanism is just not possible, KVM will simply not let it run instead of letting a guest stuck in boot. ... In any case, please understand that I'm not campaigning for this warning :) IIRC the warning was your (very welcome!) idea after I reported the problem; I'm just trying to ensure that the warning match the exact issue I encountered. Thanks! Laszlo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width
Igor Mammedov imamm...@redhat.com writes: On Wed, 08 Jul 2015 18:42:01 -0400 Bandan Das b...@redhat.com wrote: If a Linux guest is assigned more memory than is supported by the host processor, the guest is unable to boot. That is expected, however, there's no message indicating the user what went wrong. This change prints a message to stderr if KVM has the corresponding capability. Reported-by: Laszlo Ersek ler...@redhat.com Signed-off-by: Bandan Das b...@redhat.com --- linux-headers/linux/kvm.h | 1 + target-i386/kvm.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 3bac873..6afad49 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_DISABLE_QUIRKS 116 #define KVM_CAP_X86_SMM 117 #define KVM_CAP_MULTI_ADDRESS_SPACE 118 +#define KVM_CAP_PHY_ADDR_WIDTH 119 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 066d03d..66e3448 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) uint64_t shadow_mem; int ret; struct utsname utsname; +int max_phys_bits; ret = kvm_get_supported_msrs(s); if (ret 0) { @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH); max_phys_bits seems generic enough and could be applied to other targets as well. I am a little clueless about other targets but is figuring this out from userspace as simple as it's on x86 (cpuid)? If not, then I agree, this could be made a generic value. Bandan making it a property of machine, would make accessing/manipulating it easier. define default value for machine/TCG mode and when KVM is enabled it would override/set its own limit. then any board could easily access machine-max_gpa to make board specific checks. +if (max_phys_bits (1ULL max_phys_bits) = ram_size) +fprintf(stderr, Warning: The amount of memory assigned to the guest +is more than that supported by the host CPU(s). Guest may be unstable.\n); + if (kvm_check_extension(s, KVM_CAP_X86_SMM)) { smram_machine_done.notify = register_smram_listener; qemu_add_machine_init_done_notifier(smram_machine_done); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Add host physical address width capability
Let userspace inquire the maximum physical address width of the host processors; this can be used to identify maximum memory that can be assigned to the guest. Reported-by: Laszlo Ersek ler...@redhat.com Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/x86.c | 3 +++ include/uapi/linux/kvm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bbaf44e..97d6746 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_NR_MEMSLOTS: r = KVM_USER_MEM_SLOTS; break; + case KVM_CAP_PHY_ADDR_WIDTH: + r = boot_cpu_data.x86_phys_bits; + break; case KVM_CAP_PV_MMU:/* obsolete */ r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 716ad4a..e7949a1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_DISABLE_QUIRKS 116 #define KVM_CAP_X86_SMM 117 #define KVM_CAP_MULTI_ADDRESS_SPACE 118 +#define KVM_CAP_PHY_ADDR_WIDTH 119 #ifdef KVM_CAP_IRQ_ROUTING -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] target-i386: Sanity check host processor physical address width
If a Linux guest is assigned more memory than is supported by the host processor, the guest is unable to boot. That is expected, however, there's no message indicating the user what went wrong. This change prints a message to stderr if KVM has the corresponding capability. Reported-by: Laszlo Ersek ler...@redhat.com Signed-off-by: Bandan Das b...@redhat.com --- linux-headers/linux/kvm.h | 1 + target-i386/kvm.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 3bac873..6afad49 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_DISABLE_QUIRKS 116 #define KVM_CAP_X86_SMM 117 #define KVM_CAP_MULTI_ADDRESS_SPACE 118 +#define KVM_CAP_PHY_ADDR_WIDTH 119 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 066d03d..66e3448 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) uint64_t shadow_mem; int ret; struct utsname utsname; +int max_phys_bits; ret = kvm_get_supported_msrs(s); if (ret 0) { @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH); +if (max_phys_bits (1ULL max_phys_bits) = ram_size) +fprintf(stderr, Warning: The amount of memory assigned to the guest +is more than that supported by the host CPU(s). Guest may be unstable.\n); + if (kvm_check_extension(s, KVM_CAP_X86_SMM)) { smram_machine_done.notify = register_smram_listener; qemu_add_machine_init_done_notifier(smram_machine_done); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 100671] New: vmwrite error in vmx_vcpu_run
bugzilla-dae...@bugzilla.kernel.org writes: Can you please provide a little bit more information ? When does the write error happen and what guest/host are you running ? If it's a regression, would it be possible for you to bisect it ? Is Bug 100661 related to the same hardware ? Bandan https://bugzilla.kernel.org/show_bug.cgi?id=100671 Bug ID: 100671 Summary: vmwrite error in vmx_vcpu_run Product: Virtualization Version: unspecified Kernel Version: 4.1.0+ Hardware: Intel OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: jemmy858...@gmail.com Regression: No Created attachment 181321 -- https://bugzilla.kernel.org/attachment.cgi?id=181321action=edit dmesg.txt [ 1614.416934] vmwrite error: reg 6820 value 6 (err 1) [ 1614.416939] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7 [ 1614.416940] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 ) 04/30/2013 [ 1614.416941] 6138f41b 880007d1fc98 8166ab0d [ 1614.416943] 880007c78000 880007d1fca8 a082121a [ 1614.416945] 880007d1fcb8 a082123f 880007d1fcc8 a08215ba [ 1614.416946] Call Trace: [ 1614.416952] [8166ab0d] dump_stack+0x45/0x57 [ 1614.416961] [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel] [ 1614.416963] [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel] [ 1614.416966] [a08215ba] vmx_set_rflags+0x3a/0x40 [kvm_intel] [ 1614.416978] [a07a97fa] __kvm_set_rflags+0x4a/0x60 [kvm] [ 1614.416985] [a07af7a0] x86_emulate_instruction+0x590/0x740 [kvm] [ 1614.416988] [810863b1] ? __set_task_blocked+0x41/0xa0 [ 1614.416995] [a07afc6c] complete_emulated_pio+0x3c/0x60 [kvm] [ 1614.417003] [a07b34bc] kvm_arch_vcpu_ioctl_run+0x3cc/0x420 [kvm] [ 1614.417009] [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm] [ 1614.417011] [81204fa4] do_vfs_ioctl+0x2c4/0x4a0 [ 1614.417013] [8129e05d] ? selinux_file_ioctl+0x4d/0xc0 [ 1614.417015] [812051f9] SyS_ioctl+0x79/0x90 [ 1614.417017] [8167126e] entry_SYSCALL_64_fastpath+0x12/0x71 [ 1614.417020] vmwrite error: reg c08 value 0 (err 1) [ 1614.417021] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7 [ 1614.417022] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 ) 04/30/2013 [ 1614.417023] 6138f41b 880007d1fc78 8166ab0d [ 1614.417024] 880007c78000 880007d1fc88 a082121a [ 1614.417026] 880007d1fc98 a082123f 880007d1fcd8 a0823bdc [ 1614.417027] Call Trace: [ 1614.417029] [8166ab0d] dump_stack+0x45/0x57 [ 1614.417032] [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel] [ 1614.417034] [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel] [ 1614.417036] [a0823bdc] vmx_save_host_state+0x1ac/0x1e0 [kvm_intel] [ 1614.417043] [a07ac577] vcpu_enter_guest+0x4f7/0xdb0 [kvm] [ 1614.417050] [a07af7a0] ? x86_emulate_instruction+0x590/0x740 [kvm] [ 1614.417051] [810863b1] ? __set_task_blocked+0x41/0xa0 [ 1614.417058] [a07b33e5] kvm_arch_vcpu_ioctl_run+0x2f5/0x420 [kvm] [ 1614.417063] [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm] [ 1614.417065] [81204fa4] do_vfs_ioctl+0x2c4/0x4a0 [ 1614.417067] [8129e05d] ? selinux_file_ioctl+0x4d/0xc0 [ 1614.417069] [812051f9] SyS_ioctl+0x79/0x90 [ 1614.417070] [8167126e] entry_SYSCALL_64_fastpath+0x12/0x71 [ 1614.417071] vmwrite error: reg c0a value 0 (err 1) [ 1614.417073] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7 [ 1614.417073] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 ) 04/30/2013 [ 1614.417074] 6138f41b 880007d1fc78 8166ab0d [ 1614.417075] 880007c78000 880007d1fc88 a082121a [ 1614.417077] 880007d1fc98 a082123f 880007d1fcd8 a0823acb [ 1614.417078] Call Trace: [ 1614.417080] [8166ab0d] dump_stack+0x45/0x57 [ 1614.417082] [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel] [ 1614.417084] [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel] [ 1614.417087] [a0823acb] vmx_save_host_state+0x9b/0x1e0 [kvm_intel] [ 1614.417093] [a07ac577] vcpu_enter_guest+0x4f7/0xdb0 [kvm] [ 1614.417099] [a07af7a0] ? x86_emulate_instruction+0x590/0x740 [kvm] [ 1614.417100] [810863b1] ? __set_task_blocked+0x41/0xa0 [ 1614.417107] [a07b33e5] kvm_arch_vcpu_ioctl_run+0x2f5/0x420 [kvm] [ 1614.417112] [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm] [ 1614.417114]
Re: Hang on reboot in FreeBSD guest on Linux KVM host
Paolo Bonzini pbonz...@redhat.com writes: ... I did this: 1) download http://download.pcbsd.org/iso/10.1-RELEASE/amd64/PCBSD10.1.2-x64-trueos-server.raw.xz and unpack it 2) run it with qemu-kvm -drive if=virtio,PCBSD10.1.2-x64-trueos-server.raw -smp 2 3) login as root/pcbsd, type reboot I would like to know if I'm doing anything wrong. My machine is a Xeon E5 v3 (Haswell). My SeaBIOS build doesn't have the atkbd0 bug, but just I thought (from the bug report) that the bug is only reproducible on v2 Xeons ? to rule that out, can you send me your Seabios binary (/usr/share/qemu/bios*.bin) as well? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hang on reboot in FreeBSD guest on Linux KVM host
John Nielsen li...@jnielsen.net writes: On Jun 17, 2014, at 10:48 AM, John Nielsen li...@jnielsen.net wrote: On Jun 17, 2014, at 12:05 AM, Gleb Natapov g...@kernel.org wrote: On Tue, Jun 17, 2014 at 06:21:23AM +0200, Paolo Bonzini wrote: Il 16/06/2014 18:47, John Nielsen ha scritto: On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 16/06/2014 18:09, John Nielsen ha scritto: The only substantial difference on the hardware side is the CPU. The hosts where the problem occurs use Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, while the hosts that don't show the problem use the prior revision, Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz. Can you do grep . /sys/module/kvm_intel/parameters/* on both hosts please? No differences that I can see. Output below. Not really: Working host: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz # grep . /sys/module/kvm_intel/parameters/* /sys/module/kvm_intel/parameters/enable_apicv:N Problem host: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz # grep . /sys/module/kvm_intel/parameters/* /sys/module/kvm_intel/parameters/enable_apicv:Y So we have a clue. Let me study the code more, I'll try to get back with a suggestion. Wow, can't believe I missed that. Good catch! Does disabling apicv on E5-2650 v2 make reboot problem go away? Yes it does! # modprobe kvm_intel /sys/module/kvm_intel/parameters/enable_apicv:Y # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 2,sockets=1,cores=1,threads=2 -drive file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none [problem occurs] # rmmod kvm_intel # modprobe kvm_intel enable_apicv=N /sys/module/kvm_intel/parameters/enable_apicv:N # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 2,sockets=1,cores=1,threads=2 -drive file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none [problem does not occur] Thank you. This both narrows the problem considerably and provides an acceptable workaround. It would still be nice to see it fixed, of course. Keep me CC'ed as I'm not on the KVM list. I’m resurrecting an old thread since I haven’t heard anything in a while. Has anyone looked in to the KVM+apicv bug documented above as well as here: https://bugs.launchpad.net/qemu/+bug/1329956 ? If appropriate, where should I go to file a KVM bug (since this isn’t really Qemu’s problem)? Hi John, does this happen with the latest upstream kernel version ? Bandan Thanks, JN-- To unsubscribe from this list: send the line unsubscribe kvm in -- To unsubscribe from this list: send the line unsubscribe kvm in
[PATCH] KVM: nSVM: Check for NRIPS support before updating control field
If hardware doesn't support DecodeAssist - a feature that provides more information about the intercept in the VMCB, KVM decodes the instruction and then updates the next_rip vmcb control field. However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS]. Since skip_emulated_instruction() doesn't verify nrip support before accepting control.next_rip as valid, avoid writing this field if support isn't present. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/svm.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9afa233..4911bf1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -511,8 +511,10 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - if (svm-vmcb-control.next_rip != 0) + if (svm-vmcb-control.next_rip != 0) { + WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS)); svm-next_rip = svm-vmcb-control.next_rip; + } if (!svm-next_rip) { if (emulate_instruction(vcpu, EMULTYPE_SKIP) != @@ -4317,7 +4319,9 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu, break; } - vmcb-control.next_rip = info-next_rip; + /* TODO: Advertise NRIPS to guest hypervisor unconditionally */ + if (static_cpu_has(X86_FEATURE_NRIPS)) + vmcb-control.next_rip = info-next_rip; vmcb-control.exit_code = icpt_info.exit_code; vmexit = nested_svm_exit_handled(svm); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] [x86/kvm] emulate.c: Fix comppilation warning
Rajat Jain raja...@google.com writes: Fix the following warning: arch/x86/kvm/emulate.c: In function '__do_insn_fetch_bytes': arch/x86/kvm/emulate.c:814:47: warning: 'linear' may be used uninitialized in this function [-Wmaybe-uninitialized] arch/x86/kvm/emulate.c:793:16: note: 'linear' was declared here Is this recent gcc behavior ? If yes, please mention that in the commit message. Please follow convention for the subject line and fix the typo - KVM: emulate: Fix compilation warning Thanks, Bandan Signed-off-by: Rajat Jain raja...@google.com Signed-off-by: Rajat Jain rajatxj...@gmail.com --- v2: Fix the wrong email address. arch/x86/kvm/emulate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 630bcb0..f06e0ca 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -790,7 +790,7 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size) { int rc; unsigned size, max_size; - unsigned long linear; + unsigned long linear = 0; int cur_size = ctxt-fetch.end - ctxt-fetch.data; struct segmented_address addr = { .seg = VCPU_SREG_CS, .ea = ctxt-eip + cur_size }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: default legacy PCI device assignment support to n
Paolo Bonzini pbonz...@redhat.com writes: VFIO has proved itself a much better option than KVM's built-in device assignment. It is mature, provides better isolation because it enforces ACS, and even the userspace code is being tested on a wider variety of hardware these days than the legacy support. Disable legacy device assignment by default. Shouldn't we mark it as Deprecated then ? Bandan Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/Kconfig | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 413a7bf9efbb..a0f06a5947c5 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -88,13 +88,14 @@ config KVM_MMU_AUDIT config KVM_DEVICE_ASSIGNMENT bool KVM legacy PCI device assignment support depends on KVM PCI IOMMU_API - default y + default n ---help--- Provide support for legacy PCI device assignment through KVM. The kernel now also supports a full featured userspace device driver - framework through VFIO, which supersedes much of this support. + framework through VFIO, which supersedes this support and provides + better security. - If unsure, say Y. + If unsure, say N. # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: default legacy PCI device assignment support to n
Paolo Bonzini pbonz...@redhat.com writes: On 04/06/2015 16:31, Bandan Das wrote: VFIO has proved itself a much better option than KVM's built-in device assignment. It is mature, provides better isolation because it enforces ACS, and even the userspace code is being tested on a wider variety of hardware these days than the legacy support. Disable legacy device assignment by default. Shouldn't we mark it as Deprecated then ? Yes, good idea! You mean just (DEPRECATED) after the string, right? Yeah, that is what I meant. The Help message does mention VFIO as the newer option. Sometimes, seeing the word DEPRECATED makes one rethink if they really need to enable it. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 98841] New: android emulator broken since d7a2a24 kernel commit
bugzilla-dae...@bugzilla.kernel.org writes: https://bugzilla.kernel.org/show_bug.cgi?id=98841 Bug ID: 98841 Summary: android emulator broken since d7a2a24 kernel commit Product: Virtualization Version: unspecified Kernel Version: 3.17 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: adrian.sa...@asandu.eu Regression: No Hello, For a while now I've noticed the android emulator not wanting to work anymore .. With a lil' bit of git bisect I've found this kernel commit https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d7a2a246a1b5a0b0c803e800019600051e1e6f1a that seems to break it. I've mailed the author .. submited a report to the google guys ( not sure if I did it right ) https://code.google.com/p/android/issues/detail?id=174557 ... The problem was first reported by someone @ fedora .. https://bugzilla.redhat.com/show_bug.cgi?id=1187982 Alot of details ( run logs ) are there. Let me know if I can help test stuff. Ok, a quick look and it seems the patch is doing the right thing (as the spec) says. For PHYSBASE it's masking out bits 8-11 and for PHYSMASK bits 0-10. Are you sure Android Emulator isn't attempting to write to these reserved regions ? What happens if you enable ignore_msrs ? echo Y /sys/module/kvm/parameters/ignore_msrs Bandan Thanks in advance. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: MMU: fix SMAP virtualization
Boris Ostrovsky boris.ostrov...@oracle.com writes: On 05/11/2015 10:55 AM, Xiao Guangrong wrote: KVM may turn a user page to a kernel page when kernel writes a readonly user page if CR0.WP = 1. This shadow page entry will be reused after SMAP is enabled so that kernel is allowed to access this user page Fix it by setting SMAP !CR0.WP into shadow page's role and reset mmu once CR4.SMAP is updated Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- @@ -4208,12 +4211,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes) { gfn_t gfn = gpa PAGE_SHIFT; -union kvm_mmu_page_role mask = { .word = 0 }; struct kvm_mmu_page *sp; LIST_HEAD(invalid_list); u64 entry, gentry, *spte; int npte; bool remote_flush, local_flush, zap_page; +union kvm_mmu_page_role mask = (union kvm_mmu_page_role) { +.cr0_wp = 1, +.cr4_pae = 1, +.nxe = 1, +.smep_andnot_wp = 1, +.smap_andnot_wp = 1, +}; This breaks older compilers that can't initialize anon structures. How old ? Even gcc 3.1 says you can use unnamed struct/union fields and 3.2 is the minimum version required to compile the kernel as mentioned in the README. We could simply just name the structure, but I doubt this is the only place in the kernel code where it's being used this way :) Bandan -boris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: APIC_ID in apic_reg_write()
Jan Kiszka jan.kis...@siemens.com writes: ... Setting r/w permissions on a per-model is little overkill, don't you think ? If we want accurate behaviour, we should do this. If not, we probably better leave the code alone to avoid surprises for preexisting host/guest setups. Modern OSes do not care anyway, but special ones may get unhappy. Agreed, better to leave it alone. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure
Jan Kiszka jan.kis...@siemens.com writes: Am 2015-04-28 um 21:55 schrieb Bandan Das: If get_free_page() fails for nested bitmap area, it's evident that we are gonna get screwed anyway but returning failure because we failed allocating memory for a nested structure seems like an unnecessary big hammer. Also, save the call for later; after we are done with other non-nested allocations. Frankly, I prefer failures over automatic degradations. And, as you noted, the whole system will probably explode anyway if allocation of a single page already fails. So what does this buy us? Yeah... I hear you. Ok, let me put it this way - Assume that we can defer this allocation up until the point that the nested subsystem is actually used i.e L1 tries running a guest and we try to allocate this area. If get_free_page() failed in that case, would we still want to kill L1 too ? I guess no. Also, assume we had a printk in there - Failed allocating memory for nested bitmap, the novice user is going to get confused why he's getting an error about nested virtualization (for the not so distant future when nested is enabled by default :)) What could makes sense is making the allocation of the vmread/write bitmap depend on enable_shadow_vmcs, and that again depend on nested. Thanks for the suggestion. I will take a look at this one. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND kvm-unit-tests] x86/run: Rearrange the valid binary and testdev support checks
This extends the sanity checks done on known common Qemu binary paths when the user supplies a QEMU= on the command line Fixes: b895b967db94937d5b593c51b95eb32d2889a764 Signed-off-by: Bandan Das b...@redhat.com --- x86/run | 43 +++ 1 file changed, 19 insertions(+), 24 deletions(-) diff --git a/x86/run b/x86/run index 219a93b..5281fca 100755 --- a/x86/run +++ b/x86/run @@ -2,33 +2,28 @@ NOTFOUND=1 TESTDEVNOTSUPP=2 -qemukvm=${QEMU:-qemu-kvm} -qemusystem=${QEMU:-qemu-system-x86_64} +qemubinarysearch=${QEMU:-qemu-kvm qemu-system-x86_64} -if ! [ -z ${QEMU} ] -then - qemu=${QEMU} -else - for qemucmds in ${qemukvm} ${qemusystem} - do - unset QEMUFOUND - unset qemu - if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep QEMU)} ] - ${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; - then - qemu=${qemucmds} - break - fi - done - if [ -z ${QEMUFOUND} ] - then - echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable - exit ${NOTFOUND} - elif[ -z ${qemu} ] +for qemucmd in ${qemubinarysearch} +do + unset QEMUFOUND + unset qemu + if ! [ -z ${QEMUFOUND=$(${qemucmd} --help 2/dev/null | grep QEMU)} ] + ${qemucmd} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; then - echo No Qemu test device support found - exit ${TESTDEVNOTSUPP} + qemu=${qemucmd} + break fi +done + +if [ -z ${QEMUFOUND} ] +then + echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable + exit ${NOTFOUND} +elif[ -z ${qemu} ] +then + echo No Qemu test device support found + exit ${TESTDEVNOTSUPP} fi if -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure
If get_free_page() fails for nested bitmap area, it's evident that we are gonna get screwed anyway but returning failure because we failed allocating memory for a nested structure seems like an unnecessary big hammer. Also, save the call for later; after we are done with other non-nested allocations. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/vmx.c | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7b6168..200bc5c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6039,20 +6039,22 @@ static __init int hardware_setup(void) if (!vmx_msr_bitmap_longmode_x2apic) goto out4; - if (nested) { - vmx_msr_bitmap_nested = - (unsigned long *)__get_free_page(GFP_KERNEL); - if (!vmx_msr_bitmap_nested) - goto out5; - } - vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_vmread_bitmap) - goto out6; + goto out5; vmx_vmwrite_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_vmwrite_bitmap) - goto out7; + goto out6; + + if (nested) { + vmx_msr_bitmap_nested = + (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) { + printk(KERN_WARNING + vmx: Failed getting memory for nested bitmap\n); + nested = 0; + } memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); @@ -6073,7 +6075,7 @@ static __init int hardware_setup(void) if (setup_vmcs_config(vmcs_config) 0) { r = -EIO; - goto out8; + goto out7; } if (boot_cpu_has(X86_FEATURE_NX)) @@ -6190,13 +6192,12 @@ static __init int hardware_setup(void) return alloc_kvm_area(); -out8: - free_page((unsigned long)vmx_vmwrite_bitmap); out7: - free_page((unsigned long)vmx_vmread_bitmap); -out6: if (nested) free_page((unsigned long)vmx_msr_bitmap_nested); + free_page((unsigned long)vmx_vmwrite_bitmap); +out6: + free_page((unsigned long)vmx_vmread_bitmap); out5: free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic); out4: -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] x86/run: Rearrange the valid binary and testdev support checks
Ping, Paolo, did this slip through the cracks ? Bandan Das b...@redhat.com writes: This extends the sanity checks done on known common Qemu binary paths when the user supplies a QEMU= on the command line Fixes: b895b967db94937d5b593c51b95eb32d2889a764 Signed-off-by: Bandan Das b...@redhat.com --- x86/run | 43 +++ 1 file changed, 19 insertions(+), 24 deletions(-) diff --git a/x86/run b/x86/run index 219a93b..5281fca 100755 --- a/x86/run +++ b/x86/run @@ -2,33 +2,28 @@ NOTFOUND=1 TESTDEVNOTSUPP=2 -qemukvm=${QEMU:-qemu-kvm} -qemusystem=${QEMU:-qemu-system-x86_64} +qemubinarysearch=${QEMU:-qemu-kvm qemu-system-x86_64} -if ! [ -z ${QEMU} ] -then - qemu=${QEMU} -else - for qemucmds in ${qemukvm} ${qemusystem} - do - unset QEMUFOUND - unset qemu - if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep QEMU)} ] -${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; - then - qemu=${qemucmds} - break - fi - done - if [ -z ${QEMUFOUND} ] - then - echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable - exit ${NOTFOUND} - elif[ -z ${qemu} ] +for qemucmd in ${qemubinarysearch} +do + unset QEMUFOUND + unset qemu + if ! [ -z ${QEMUFOUND=$(${qemucmd} --help 2/dev/null | grep QEMU)} ] + ${qemucmd} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; then - echo No Qemu test device support found - exit ${TESTDEVNOTSUPP} + qemu=${qemucmd} + break fi +done + +if [ -z ${QEMUFOUND} ] +then + echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable + exit ${NOTFOUND} +elif[ -z ${qemu} ] +then + echo No Qemu test device support found + exit ${TESTDEVNOTSUPP} fi if -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x2apic issues with Solaris and Xen guests
Michael Tokarev m...@tls.msk.ru writes: 20.04.2015 20:29, Jan Kiszka wrote: On 2015-04-20 19:07, Stefan Hajnoczi wrote: I wonder whether the following two x2apic issues are related: Solaris 10 U11 network doesn't work https://bugzilla.redhat.com/show_bug.cgi?id=1040500 kvm - fails to setup timer interrupt via io-apic (Thanks to Michael Tokarev for posting this link) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528077#68 [] Has anyone looked into this? Not yet. Is there a handy reproduction guest image? Or maybe someone would like to start with tracing what the guest and the host do. The second link gives a trivial reproducer, you need just the xen hipervisor binary and some kernel. This should be easy too, because it happens right on boot. But I guess it requires some inner knowlege of xen early boot machinery. Have you tried Radim's patch ? commit c806a6ad35bfa6c92249cd0ca4772d5ac3f8cb68 Author: Radim Krčmář rkrc...@redhat.com Date: Wed Mar 18 19:38:22 2015 +0100 KVM: x86: call irq notifiers with directed EOI Thanks, /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vt-x: Preserve host CR4.MCE value while in guest mode.
Ben Serebrin sereb...@google.com writes: I suggest changing the subject to KVM: VMX: Preserve host CR4.MCE value while in guest mode That keeps it consistent with the $SUBJECT naming convention KVM follows. If Paolo is ok with changing it in his tree directly, that's fine too. Bandan The host's decision to enable machine check exceptions should remain in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset and passed a slightly-modified 0 to the vmcs.guest_cr4 value. Tested: Built. On earlier version, tested by injecting machine check while a guest is spinning. Before the change, if guest CR4.MCE==0, then the machine check is escalated to Catastrophic Error (CATERR) and the machine dies. If guest CR4.MCE==1, then the machine check causes VMEXIT and is handled normally by host Linux. After the change, injecting a machine check causes normal Linux machine check handling. Signed-off-by: Ben Serebrin sereb...@google.com --- arch/x86/kvm/vmx.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f5e8dce..f7b6168 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3622,8 +3622,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ? - KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); + /* + * Pass through host's Machine Check Enable value to hw_cr4, which + * is in force while we are in guest mode. Do not let guests control + * this bit, even if host CR4.MCE == 0. + */ + unsigned long hw_cr4 = + (cr4_read_shadow() X86_CR4_MCE) | + (cr4 ~X86_CR4_MCE) | + (to_vmx(vcpu)-rmode.vm86_active ? + KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); if (cr4 X86_CR4_VMXE) { /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: use kthread_run macro
Code Cleanup, kthread_run is a convenient wrapper around kthread_create() that even wakes up the process for us. Use it and remove no longer needed temp task_struct variable. Signed-off-by: Bandan Das b...@redhat.com --- drivers/vhost/vhost.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..9ac66f7 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -366,7 +366,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner); /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - struct task_struct *worker; int err; /* Is there an owner already? */ @@ -377,15 +376,12 @@ long vhost_dev_set_owner(struct vhost_dev *dev) /* No owner, become one */ dev-mm = get_task_mm(current); - worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid); - if (IS_ERR(worker)) { - err = PTR_ERR(worker); + dev-worker = kthread_run(vhost_worker, dev, vhost-%d, current-pid); + if (IS_ERR(dev-worker)) { + err = PTR_ERR(dev-worker); goto err_worker; } - dev-worker = worker; - wake_up_process(worker);/* avoid contributing to loadavg */ - err = vhost_attach_cgroups(dev); if (err) goto err_cgroup; @@ -396,7 +392,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev) return 0; err_cgroup: - kthread_stop(worker); + kthread_stop(dev-worker); dev-worker = NULL; err_worker: if (dev-mm) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Nadav Amit nadav.a...@gmail.com writes: Avi Kivity avi.kiv...@gmail.com wrote: On 04/09/2015 09:21 PM, Nadav Amit wrote: Bandan Das b...@redhat.com wrote: Nadav Amit nadav.a...@gmail.com writes: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. The probability that someone will use a newer version of kernel with something as old as 0.12 is probably minimal. I think it's ok to change it with a comment indicating the reason. To be on the safe side, however, a user changeable switch is something worth considering. I don’t see any existing mechanism for KVM to be aware of its user type and version. I do see another case of KVM hacks that are intended for fixing very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are from pretty much the same time-frame of the issue I try to fix). Since this is something which would follow around, please advise what would be the format. A new ioctl that would supply the userspace “type” (according to predefined constants) and version? That would be madness. KVM shouldn't even know that qemu exists, let alone track its versions. Simply add a new toggle KVM_USE_STANDARD_LAPIC_LVT_INIT and document that userspace MUST use it. Old userspace won't, and will get the old buggy behavior. I fully agree it would be madness. Yet it appears to be a recurring problem. Here are similar problems found from a short search: 1. vmx_set_segment setting segment accessed (3a624e29c75) 2. svm_set_cr0 clearing CD and NW (709ddebf81c) 3. Limited number of MTRRs due to Seabios bus (0d234daf7e0a) Excluding (1) all of the other issues are related to the VM BIOS. Perhaps KVM should somehow realize which VM BIOS runs? (yes, it sounds just as bad.) How about renaming the toggle Avi mentioned above to something more generic (KVM_DISABLE_LEGACY_QUIRKS ?) and grouping all the issues together ? Modern userspace will always enable it and get the new correct behavior. When more cases are discovered, KVM can just add them to the list. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord
Re: [PATCH] x86-run: Print a meaningful message if the qemu binary isn't found
Bandan Das b...@redhat.com writes: ... I think we still want to print this if the user supplied qemu has no test device support. I think this is better - --- a/x86/run +++ b/x86/run @@ -8,6 +8,7 @@ qemusystem=${QEMU:-qemu-system-x86_64} if ! [ -z ${QEMU} ] then qemu=${QEMU} + QEMUFOUND=1 else Damn, please ignore the previous message. This does nothing and I need more sleep. Bandan for qemucmds in ${qemukvm} ${qemusystem} do +exit ${TESTDEVNOTSUPP} +fi fi if which apart from reindentation is just moving a fi down. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86-run: Print a meaningful message if the qemu binary isn't found
Paolo Bonzini pbonz...@redhat.com writes: ... This breaks QEMU=... ./x86-run ./x86/msr.flat Fixed as follows: diff --git a/x86/run b/x86/run index d5adf8d..219a93b 100755 --- a/x86/run +++ b/x86/run @@ -20,16 +20,15 @@ else break fi done -fi - -if [ -z ${QEMUFOUND} ] -then - echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable - exit ${NOTFOUND} -elif[ -z ${qemu} ] -then - echo No Qemu test device support found - exit ${TESTDEVNOTSUPP} + if [ -z ${QEMUFOUND} ] + then + echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable + exit ${NOTFOUND} + elif[ -z ${qemu} ] + then + echo No Qemu test device support found I think we still want to print this if the user supplied qemu has no test device support. I think this is better - --- a/x86/run +++ b/x86/run @@ -8,6 +8,7 @@ qemusystem=${QEMU:-qemu-system-x86_64} if ! [ -z ${QEMU} ] then qemu=${QEMU} + QEMUFOUND=1 else for qemucmds in ${qemukvm} ${qemusystem} do + exit ${TESTDEVNOTSUPP} + fi fi if which apart from reindentation is just moving a fi down. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Nadav Amit nadav.a...@gmail.com writes: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. The probability that someone will use a newer version of kernel with something as old as 0.12 is probably minimal. I think it's ok to change it with a comment indicating the reason. To be on the safe side, however, a user changeable switch is something worth considering. What is the verdict? Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different
Paolo Bonzini pbonz...@redhat.com writes: On 07/04/2015 18:17, Bandan Das wrote: I think a bool argument is good enough. QEMU has different functions, and init ends up doing save/reset/restore which is pretty ugly. Right, I meant that init could just be a wrapper so that it atleast shows up in a backtrace - could be helpful for debugging. I suspect that the compiler would inline any sensible implementation and it wouldn't show up in the backtraces. :( noinline ? :) Anyway, it's probably not worth the trouble, that could be easily figured out. Thanks, Bandan Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different
Paolo Bonzini pbonz...@redhat.com writes: On 02/04/2015 04:17, Bandan Das wrote: x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. EFER is supposed NOT to be reset according to the SDM, but leaving the LMA and LME untouched causes failed VM-entry. Therefore we reset EFER (although it is unclear whether the rest of EFER bits should be reset). Thanks! This was actually in my todo list. #INIT and #RESET are actually separate pins on the processor. So, shouldn't we differentiate between the two too by having (*vcpu_init) and (*vcpu_reset) separate ? I think a bool argument is good enough. QEMU has different functions, and init ends up doing save/reset/restore which is pretty ugly. Right, I meant that init could just be a wrapper so that it atleast shows up in a backtrace - could be helpful for debugging. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different
Nadav Amit na...@cs.technion.ac.il writes: x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. EFER is supposed NOT to be reset according to the SDM, but leaving the LMA and LME untouched causes failed VM-entry. Therefore we reset EFER (although it is unclear whether the rest of EFER bits should be reset). Thanks! This was actually in my todo list. #INIT and #RESET are actually separate pins on the processor. So, shouldn't we differentiate between the two too by having (*vcpu_init) and (*vcpu_reset) separate ? Bandan References (from Intel SDM): If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [8.4.2: MP Initialization Protocol Requirements and Restrictions] [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] If the processor is reset by asserting the INIT# pin, the x87 FPU state is not changed. [9.2: X87 FPU INITIALIZATION] The state of the local APIC following an INIT reset is the same as it is after a power-up or hardware reset, except that the APIC ID and arbitration ID registers are not affected. [10.4.7.3: Local APIC State After an INIT Reset (“Wait-for-SIPI” State)] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 6 +++--- arch/x86/kvm/lapic.c| 11 ++- arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 30 +- arch/x86/kvm/x86.c | 17 ++--- 6 files changed, 38 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bf5a160..59f4374 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -701,7 +701,7 @@ struct kvm_x86_ops { /* Create, but do not attach this VCPU */ struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id); void (*vcpu_free)(struct kvm_vcpu *vcpu); - void (*vcpu_reset)(struct kvm_vcpu *vcpu); + void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event); void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); @@ -989,7 +989,7 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); -int fx_init(struct kvm_vcpu *vcpu); +int fx_init(struct kvm_vcpu *vcpu, bool init_event); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); @@ -1134,7 +1134,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int kvm_cpu_get_interrupt(struct kvm_vcpu *v); -void kvm_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event); void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu); void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, unsigned long address); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34d..17da6fc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1534,7 +1534,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) } -void kvm_lapic_reset(struct kvm_vcpu *vcpu) +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) { struct kvm_lapic *apic; int i; @@ -1548,7 +1548,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(apic-lapic_timer.timer); - kvm_apic_set_id(apic, vcpu-vcpu_id); + if (!init_event) + kvm_apic_set_id(apic, vcpu-vcpu_id); kvm_apic_set_version(apic-vcpu); for (i = 0; i APIC_LVT_NUM; i++) @@ -1689,7 +1690,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ - kvm_lapic_reset(vcpu); + kvm_lapic_reset(vcpu, false); kvm_iodevice_init(apic-dev, apic_mmio_ops); return 0; @@ -2023,8 +2024,8 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) pe = xchg(apic-pending_events, 0); if (test_bit(KVM_APIC_INIT, pe)) { - kvm_lapic_reset(vcpu); - kvm_vcpu_reset(vcpu); + kvm_lapic_reset(vcpu, true); + kvm_vcpu_reset(vcpu, true); if (kvm_vcpu_is_bsp(apic-vcpu)) vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; else diff --git
[PATCH] x86-run: Print a meaningful message if the qemu binary isn't found
Before: ./x86-run ./x86/msr.flat QEMU binary has no support for test device. Exiting. After: ./x86-run ./x86/msr.flat A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable Signed-off-by: Bandan Das b...@redhat.com --- x86/run | 37 ++--- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/x86/run b/x86/run index af37eb4..d5adf8d 100755 --- a/x86/run +++ b/x86/run @@ -1,20 +1,35 @@ #!/bin/bash +NOTFOUND=1 +TESTDEVNOTSUPP=2 qemukvm=${QEMU:-qemu-kvm} qemusystem=${QEMU:-qemu-system-x86_64} -if - ${qemukvm} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; + +if ! [ -z ${QEMU} ] then - qemu=${qemukvm} + qemu=${QEMU} else - if - ${qemusystem} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; - then - qemu=${qemusystem} - else - echo QEMU binary ${QEMU} has no support for test device. Exiting. - exit 2 - fi + for qemucmds in ${qemukvm} ${qemusystem} + do + unset QEMUFOUND + unset qemu + if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep QEMU)} ] + ${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ /dev/null; + then + qemu=${qemucmds} + break + fi + done +fi + +if [ -z ${QEMUFOUND} ] +then + echo A QEMU binary was not found, You can set a custom location by using the QEMU=path environment variable + exit ${NOTFOUND} +elif[ -z ${qemu} ] +then + echo No Qemu test device support found + exit ${TESTDEVNOTSUPP} fi if -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Bandan Das b...@redhat.com writes: Andrey Korolyov and...@xdel.ru writes: ... http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz Something a bit more interesting, but the mess is happening just *after* NMI firing. What happens if NMI is turned off on the host ? Sorry, I meant the watchdog.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Andrey Korolyov and...@xdel.ru writes: ... http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz Something a bit more interesting, but the mess is happening just *after* NMI firing. What happens if NMI is turned off on the host ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Error upgrading vm from windows 8 to 8.1
coreys cor...@affinitygs.com writes: When I try to upgrade my guest windows 8 vm to window 8.1. I get the error of processor doesn't support CompareExchange128. Haven not been able to find any information about this error. I think that's because cmpxchg16b is not emulated yet: static int em_cmpxchg8b(struct x86_emulate_ctxt *ctxt) { u64 old = ctxt-dst.orig_val64; if (ctxt-dst.bytes == 16) return X86EMUL_UNHANDLEABLE; ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Radim Krčmář rkrc...@redhat.com writes: 2015-03-26 21:24+0300, Andrey Korolyov: On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote: 2015-03-26 20:08+0300, Andrey Korolyov: KVM internal error. Suberror: 2 extra data[0]: 80ef extra data[1]: 8b0d Btw. does this part ever change? I see that first report had: KVM internal error. Suberror: 2 extra data[0]: 80d1 extra data[1]: 8b0d Was that a Windows guest by any chance? Yes, exactly, different extra data output was from a Windows VMs. Windows uses vector 0xd1 for timer interrupts. I second Bandan -- checking that it reproduces on other machine would be great for sanity :) (Although a bug in our APICv is far more likely.) If it's APICv related, a run without apicv enabled could give more hints. Your devices not getting reset hypothesis makes the most sense to me, maybe the timer vector in the error message is just one part of the whole story. Another misbehaving interrupt from the dark comes in at the same time and leads to a double fault. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Hi Andrey, Andrey Korolyov and...@xdel.ru writes: On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote: For now, it looks like bug have a mixed Murphy-Heisenberg nature, as it appearance is very rare (compared to the number of actual launches) and most probably bounded to the physical characteristics of my production nodes. As soon as I reach any reproducible path for a regular workstation environment, I`ll let everyone know. Also I am starting to think that issue can belong to the particular motherboard firmware revision, despite fact that the CPU microcode is the same everywhere. I will take the risk and say this - could it be a processor bug ? :) Hello everyone, I`ve managed to reproduce this issue *deterministically* with latest seabios with smp fix and 3.18.3. The error occuring just *once* per vm until hypervisor reboots, at least in my setup, this is definitely crazy... - launch two VMs (Centos 7 in my case), - wait a little while they are booting, - attach serial console (I am using virsh list for this exact purpose), - issue acpi reboot or reset, does not matter, - VM always hangs at boot, most times with sgabios initialization string printed out [1], but sometimes it hangs a bit later [2], - no matter how many times I try to relaunch the QEMU afterwards, the issue does not appear on VM which experienced problem once; - trace and sample args can be seen in [3] and [4] respectively. My system is a Dell R720 dual socket which has 2620v2s. I tried your setup but couldn't reproduce (my qemu cmdline isn't exactly the same as yours), although, if you could simplify your command line a bit, I can try again. Bandan 1) Google, Inc. Serial Graphics Adapter 06/11/14 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 Term: 211x62 4 0 2) Google, Inc. Serial Graphics Adapter 06/11/14 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 Term: 211x62 4 0 [...empty screen...] SeaBIOS (version 1.8.1-20150325_230423-testnode) Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10 3) KVM internal error. Suberror: 2 extra data[0]: 80ef extra data[1]: 8b0d EAX= EBX= ECX= EDX= ESI= EDI= EBP= ESP=6d2c EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= 000f6cb0 0037 IDT= 03ff CR0=0010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f ba 2d d4 fe fb 3f 4) /usr/bin/qemu-system-x86_64 -name centos71 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -uuid 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config -nodefaults -device sga -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1 -msg timestamp=on -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow networking with VM virtio
[Ccing qemu-devel and kvm] Boylan, Ross ross.boy...@ucsf.edu writes: Running a Windows 7 VM under KVM, with virtio drivers, the networking was so slow as to be non-functional. As suggested in comment 11 of https://bugzilla.redhat.com/show_bug.cgi?id=855640, I did ethtool -K eth1 gro off and speeds got much better, ~150Mb/s, although using speedtest.net it was unable to connect to the server for the upload test. However, ssh in to the host machine, or to linux VM's running on the same host, became extremely erratic (prone to dropped connections) and slow after making these changes. I was connected via sshuttle for VPN. Does anyone have any suggestions? I would appreciate a cc. Thanks. Ross Boylan Details: Host is running Debian wheezy on amd64. Using virt-manager libvirt kvm and bridged networking. Linux markov00 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux Here's one of the network specifications for libvirt: interface type='direct' mac address='52:54:00:61:'/ source dev='eth1' mode='vepa'/ model type='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x03' function='0x0'/ /interface The Windows drivers came from http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/virtio-win-0.1-100.iso in the 32 bit windows 7 directory of the iso (host is 64 bit, as is the emulated machine; Windows 7 is 32 bit). -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: nVMX: Add support for rdtscp
Jan Kiszka jan.kis...@web.de writes: ... --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_RDTSCP | SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | @@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING); case EXIT_REASON_RDPMC: return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING); - case EXIT_REASON_RDTSC: + case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP: return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING); case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD: @@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) exec_control); } } + if (!vmx-rdtscp_enabled) + vmx-nested.nested_vmx_secondary_ctls_high = + ~SECONDARY_EXEC_RDTSCP; No need to do this if nested is not enabled ? Or just a if (nested) in the prior if else loop should be enough I think. Bandan } /* Exposing INVPCID only when PCID is exposed */ @@ -9146,8 +9150,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) exec_control = ~SECONDARY_EXEC_RDTSCP; /* Take the following fields only from vmcs12 */ exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_RDTSCP | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | - SECONDARY_EXEC_APIC_REGISTER_VIRT); + SECONDARY_EXEC_APIC_REGISTER_VIRT); if (nested_cpu_has(vmcs12, CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) exec_control |= vmcs12-secondary_vm_exec_control; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: nVMX: Add support for rdtscp
Jan Kiszka jan.kis...@web.de writes: On 2015-03-23 18:01, Bandan Das wrote: Jan Kiszka jan.kis...@web.de writes: ... --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_RDTSCP | SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | @@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING); case EXIT_REASON_RDPMC: return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING); - case EXIT_REASON_RDTSC: + case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP: return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING); case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD: @@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) exec_control); } } + if (!vmx-rdtscp_enabled) + vmx-nested.nested_vmx_secondary_ctls_high = + ~SECONDARY_EXEC_RDTSCP; No need to do this if nested is not enabled ? Or just a if (nested) in the prior if else loop should be enough I think. I can add this - but this is far away from being a hotpath. What would be the benefit? Right, definitely not a hotpath, just seems unnecessary if nested is not enabled. Bandan Thanks, Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)
[Ccing netdev and Stefan] Bandan Das b...@redhat.com writes: jacob jacob opstk...@gmail.com writes: On Mon, Mar 16, 2015 at 2:12 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: I also see the following in dmesg in the VM. [0.095758] ACPI: PCI Root Bridge [PCI0] (domain [bus 00-ff]) [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed, disabling PCIe ASPM [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC support mask: 0x08) IIRC, For OSC control, after BIOS is done with (whatever initialization it needs to do), it clears a bit so that the OS can take over. This message, you are getting is a sign of a bug in the BIOS (usually). But I don't know if this is related to your problem. Does dmesg | grep -e DMAR -e IOMMU give anything useful ? Do not see anything useful in the output.. Ok, Thanks. Can you please post the output as well ? [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. Does this indicate any issue related to PCI passthrough? Would really appreciate any input on how to bebug this further. Did you get a chance to try a newer kernel ? Currently am using 3.18.7-200.fc21.x86_64 which is pretty recent. Are you suggesting trying the newer kernel just on the host? (or VM too?) Both preferably to 3.19. But it's just a wild guess. I saw i40e related fixes, particularly i40e: fix un-necessary Tx hangs in 3.19-rc5. This is not exactly what you are seeing but I was still wondering if it could help. Actually, Stefan suggests that support for this card is still sketchy and your best bet is to try out net-next http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git Also, could you please post more information about your hardware setup (chipset/processor/firmware version on the card etc) ? Thanks, Bandan Meanwhile, I am trying to get hold of a card myself to try and reproduce it at my end. Thanks, Bandan On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote: So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. I tried with the latest DPDK release too (dpdk-1.8.0) and see the same issue. As mentioned earlier, i do not see any issues at all when running tests using either i40e or dpdk on the host itself. This is the reason why i am suspecting if it is anything to do with KVM/libvirt. Both with regular PCI passthrough and VF passthrough i see issues. It is always pointing to some issue with packet transmission. Receive seems to work ok. On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: Hi, Seeing failures when trying to do PCI passthrough of Intel XL710 40G interface to KVM vm. 0a:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 01) You are assigning the PF right ? Does assigning VFs work or it's the same behavior ? Yes.Assigning VFs worked ok.But this had other issues while bringing down VMs. Interested in finding out if PCI passthrough of 40G intel XL710 interface is qualified in some specific kernel/kvm release. So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. From dmesg on host: [80326.559674] kvm: zapping shadow pages for mmio generation wraparound [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606 These are harmless and are related to unimplemented PMU msrs, not VFIO. Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: call irq notifiers with directed EOI
Radim Krčmář rkrc...@redhat.com writes: kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled. We need to do that for irq notifiers. (Like with edge interrupts.) Fix it by skipping EOI broadcast only. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211 Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/ioapic.c | 4 +++- arch/x86/kvm/lapic.c | 3 +-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index b1947e0f3e10..46d4449772bc 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, struct kvm_ioapic *ioapic, int vector, int trigger_mode) { int i; + struct kvm_lapic *apic = vcpu-arch.apic; for (i = 0; i IOAPIC_NUM_PINS; i++) { union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i]; @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i); spin_lock(ioapic-lock); - if (trigger_mode != IOAPIC_LEVEL_TRIG) + if (trigger_mode != IOAPIC_LEVEL_TRIG || + kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) continue; ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34de24c7..4ee827d7bf36 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2) static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) { - if (!(kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { int trigger_mode; if (apic_test_vector(vector, apic-regs + APIC_TMR)) trigger_mode = IOAPIC_LEVEL_TRIG; Works on my Xen 4.4 L1 setup with Intel E5 v2 host. Without this patch, L1 panics as reported in the bug referenced above. Tested-by: Bandan Dasb...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: call irq notifiers with directed EOI
Radim Krčmář rkrc...@redhat.com writes: kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled. We need to do that for irq notifiers. (Like with edge interrupts.) Wow! It's interesting that this path is only hit with Xen as guest. I always thought of directed EOI as a security feature since broadcast could lead to interrupt storms (or something like that) :) Bandan Fix it by skipping EOI broadcast only. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211 Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/ioapic.c | 4 +++- arch/x86/kvm/lapic.c | 3 +-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index b1947e0f3e10..46d4449772bc 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, struct kvm_ioapic *ioapic, int vector, int trigger_mode) { int i; + struct kvm_lapic *apic = vcpu-arch.apic; for (i = 0; i IOAPIC_NUM_PINS; i++) { union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i]; @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i); spin_lock(ioapic-lock); - if (trigger_mode != IOAPIC_LEVEL_TRIG) + if (trigger_mode != IOAPIC_LEVEL_TRIG || + kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) continue; ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34de24c7..4ee827d7bf36 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2) static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) { - if (!(kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { int trigger_mode; if (apic_test_vector(vector, apic-regs + APIC_TMR)) trigger_mode = IOAPIC_LEVEL_TRIG; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM emulation failure with recent kernel and QEMU Seabios
Gerd Hoffmann kra...@redhat.com writes: On Mo, 2015-03-16 at 14:16 -0400, Bandan Das wrote: Jan Kiszka jan.kis...@siemens.com writes: Am 2015-03-12 um 09:11 schrieb Gerd Hoffmann: On Do, 2015-03-12 at 09:09 +0100, Jan Kiszka wrote: Hi, apparently since the latest QEMU updates I'm getting this once in a while: http://www.seabios.org/pipermail/seabios/2015-March/008897.html OK... So we are waiting on a stable release to pull in new binaries with this fix? A matter of days? 1.8.1 is now available on the downloads page. I know, I've uploaded it ;) Yep, I believe you ;) I copied this information here from one of your other emails for Jan's benefit :) Pull request for the update was sent yesterday too. It's merged meanwhile, so go fetch qemu master and be happy. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)
jacob jacob opstk...@gmail.com writes: I also see the following in dmesg in the VM. [0.095758] ACPI: PCI Root Bridge [PCI0] (domain [bus 00-ff]) [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed, disabling PCIe ASPM [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC support mask: 0x08) IIRC, For OSC control, after BIOS is done with (whatever initialization it needs to do), it clears a bit so that the OS can take over. This message, you are getting is a sign of a bug in the BIOS (usually). But I don't know if this is related to your problem. Does dmesg | grep -e DMAR -e IOMMU give anything useful ? [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. Does this indicate any issue related to PCI passthrough? Would really appreciate any input on how to bebug this further. Did you get a chance to try a newer kernel ? On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote: So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. I tried with the latest DPDK release too (dpdk-1.8.0) and see the same issue. As mentioned earlier, i do not see any issues at all when running tests using either i40e or dpdk on the host itself. This is the reason why i am suspecting if it is anything to do with KVM/libvirt. Both with regular PCI passthrough and VF passthrough i see issues. It is always pointing to some issue with packet transmission. Receive seems to work ok. On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: Hi, Seeing failures when trying to do PCI passthrough of Intel XL710 40G interface to KVM vm. 0a:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 01) You are assigning the PF right ? Does assigning VFs work or it's the same behavior ? Yes.Assigning VFs worked ok.But this had other issues while bringing down VMs. Interested in finding out if PCI passthrough of 40G intel XL710 interface is qualified in some specific kernel/kvm release. So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. From dmesg on host: [80326.559674] kvm: zapping shadow pages for mmio generation wraparound [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606 These are harmless and are related to unimplemented PMU msrs, not VFIO. Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM emulation failure with recent kernel and QEMU Seabios
Jan Kiszka jan.kis...@siemens.com writes: Am 2015-03-12 um 09:11 schrieb Gerd Hoffmann: On Do, 2015-03-12 at 09:09 +0100, Jan Kiszka wrote: Hi, apparently since the latest QEMU updates I'm getting this once in a while: http://www.seabios.org/pipermail/seabios/2015-March/008897.html OK... So we are waiting on a stable release to pull in new binaries with this fix? A matter of days? 1.8.1 is now available on the downloads page. Bandan Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: SVM: Fix confusing message if no exit handlers are installed
I hit this path on a AMD box and thought someone was playing a April Fool's joke on me. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cc618c8..4e77110 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3555,7 +3555,7 @@ static int handle_exit(struct kvm_vcpu *vcpu) if (exit_code = ARRAY_SIZE(svm_exit_handlers) || !svm_exit_handlers[exit_code]) { - WARN_ONCE(1, vmx: unexpected exit reason 0x%x\n, exit_code); + WARN_ONCE(1, svm: unexpected exit reason 0x%x\n, exit_code); kvm_queue_exception(vcpu, UD_VECTOR); return 1; } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)
jacob jacob opstk...@gmail.com writes: On Mon, Mar 16, 2015 at 2:12 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: I also see the following in dmesg in the VM. [0.095758] ACPI: PCI Root Bridge [PCI0] (domain [bus 00-ff]) [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed, disabling PCIe ASPM [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC support mask: 0x08) IIRC, For OSC control, after BIOS is done with (whatever initialization it needs to do), it clears a bit so that the OS can take over. This message, you are getting is a sign of a bug in the BIOS (usually). But I don't know if this is related to your problem. Does dmesg | grep -e DMAR -e IOMMU give anything useful ? Do not see anything useful in the output.. Ok, Thanks. Can you please post the output as well ? [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. Does this indicate any issue related to PCI passthrough? Would really appreciate any input on how to bebug this further. Did you get a chance to try a newer kernel ? Currently am using 3.18.7-200.fc21.x86_64 which is pretty recent. Are you suggesting trying the newer kernel just on the host? (or VM too?) Both preferably to 3.19. But it's just a wild guess. I saw i40e related fixes, particularly i40e: fix un-necessary Tx hangs in 3.19-rc5. This is not exactly what you are seeing but I was still wondering if it could help. Meanwhile, I am trying to get hold of a card myself to try and reproduce it at my end. Thanks, Bandan On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote: So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. I tried with the latest DPDK release too (dpdk-1.8.0) and see the same issue. As mentioned earlier, i do not see any issues at all when running tests using either i40e or dpdk on the host itself. This is the reason why i am suspecting if it is anything to do with KVM/libvirt. Both with regular PCI passthrough and VF passthrough i see issues. It is always pointing to some issue with packet transmission. Receive seems to work ok. On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: Hi, Seeing failures when trying to do PCI passthrough of Intel XL710 40G interface to KVM vm. 0a:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 01) You are assigning the PF right ? Does assigning VFs work or it's the same behavior ? Yes.Assigning VFs worked ok.But this had other issues while bringing down VMs. Interested in finding out if PCI passthrough of 40G intel XL710 interface is qualified in some specific kernel/kvm release. So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. From dmesg on host: [80326.559674] kvm: zapping shadow pages for mmio generation wraparound [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606 These are harmless and are related to unimplemented PMU msrs, not VFIO. Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: svm: remove SVM_EXIT_READ_CR* intercepts
Joel Schopp joel.sch...@amd.com writes: There isn't really a valid reason for kvm to intercept cr* reads on svm hardware. The current kvm code just ends up returning the register Signed-off-by: Joel Schopp joel.sch...@amd.com --- arch/x86/kvm/svm.c | 41 - 1 file changed, 4 insertions(+), 37 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cc618c8..c3d10e6 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1090,9 +1090,6 @@ static void init_vmcb(struct vcpu_svm *svm) svm-vcpu.fpu_active = 1; svm-vcpu.arch.hflags = 0; - set_cr_intercept(svm, INTERCEPT_CR0_READ); - set_cr_intercept(svm, INTERCEPT_CR3_READ); - set_cr_intercept(svm, INTERCEPT_CR4_READ); set_cr_intercept(svm, INTERCEPT_CR0_WRITE); set_cr_intercept(svm, INTERCEPT_CR3_WRITE); set_cr_intercept(svm, INTERCEPT_CR4_WRITE); @@ -1174,7 +1171,6 @@ static void init_vmcb(struct vcpu_svm *svm) control-nested_ctl = 1; clr_intercept(svm, INTERCEPT_INVLPG); clr_exception_intercept(svm, PF_VECTOR); - clr_cr_intercept(svm, INTERCEPT_CR3_READ); clr_cr_intercept(svm, INTERCEPT_CR3_WRITE); save-g_pat = 0x0007040600070406ULL; save-cr3 = 0; @@ -2968,29 +2964,10 @@ static int cr_interception(struct vcpu_svm *svm) kvm_queue_exception(svm-vcpu, UD_VECTOR); return 1; } - } else { /* mov from cr */ - switch (cr) { - case 0: - val = kvm_read_cr0(svm-vcpu); - break; - case 2: - val = svm-vcpu.arch.cr2; - break; - case 3: - val = kvm_read_cr3(svm-vcpu); - break; - case 4: - val = kvm_read_cr4(svm-vcpu); - break; - case 8: - val = kvm_get_cr8(svm-vcpu); - break; - default: - WARN(1, unhandled read from CR%d, cr); - kvm_queue_exception(svm-vcpu, UD_VECTOR); - return 1; - } - kvm_register_write(svm-vcpu, reg, val); + } else { /* mov from cr, should never trap in svm */ + WARN(1, unhandled read from CR%d, cr); + kvm_queue_exception(svm-vcpu, UD_VECTOR); + return 1; Can we end up here if a nested hypervisor sets cr read interception ? Bandan } kvm_complete_insn_gp(svm-vcpu, err); @@ -3321,10 +3298,6 @@ static int mwait_interception(struct vcpu_svm *svm) } static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { - [SVM_EXIT_READ_CR0] = cr_interception, - [SVM_EXIT_READ_CR3] = cr_interception, - [SVM_EXIT_READ_CR4] = cr_interception, - [SVM_EXIT_READ_CR8] = cr_interception, [SVM_EXIT_CR0_SEL_WRITE]= emulate_on_interception, [SVM_EXIT_WRITE_CR0]= cr_interception, [SVM_EXIT_WRITE_CR3]= cr_interception, @@ -4151,11 +4124,9 @@ static const struct __x86_intercept { u32 exit_code; enum x86_intercept_stage stage; } x86_intercept_map[] = { - [x86_intercept_cr_read] = POST_EX(SVM_EXIT_READ_CR0), [x86_intercept_cr_write]= POST_EX(SVM_EXIT_WRITE_CR0), [x86_intercept_clts]= POST_EX(SVM_EXIT_WRITE_CR0), [x86_intercept_lmsw]= POST_EX(SVM_EXIT_WRITE_CR0), - [x86_intercept_smsw]= POST_EX(SVM_EXIT_READ_CR0), [x86_intercept_dr_read] = POST_EX(SVM_EXIT_READ_DR0), [x86_intercept_dr_write]= POST_EX(SVM_EXIT_WRITE_DR0), [x86_intercept_sldt]= POST_EX(SVM_EXIT_LDTR_READ), @@ -4221,10 +4192,6 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu, goto out; switch (icpt_info.exit_code) { - case SVM_EXIT_READ_CR0: - if (info-intercept == x86_intercept_cr_read) - icpt_info.exit_code += info-modrm_reg; - break; case SVM_EXIT_WRITE_CR0: { unsigned long cr0, val; u64 intercept; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI passthrough of 40G ethernet interface (Openstack/KVM)
jacob jacob opstk...@gmail.com writes: Hi, Seeing failures when trying to do PCI passthrough of Intel XL710 40G interface to KVM vm. 0a:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 01) You are assigning the PF right ? Does assigning VFs work or it's the same behavior ? From dmesg on host: [80326.559674] kvm: zapping shadow pages for mmio generation wraparound [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606 These are harmless and are related to unimplemented PMU msrs, not VFIO. Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI passthrough of 40G ethernet interface (Openstack/KVM)
jacob jacob opstk...@gmail.com writes: On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote: jacob jacob opstk...@gmail.com writes: Hi, Seeing failures when trying to do PCI passthrough of Intel XL710 40G interface to KVM vm. 0a:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 01) You are assigning the PF right ? Does assigning VFs work or it's the same behavior ? Yes.Assigning VFs worked ok.But this had other issues while bringing down VMs. Interested in finding out if PCI passthrough of 40G intel XL710 interface is qualified in some specific kernel/kvm release. So, it could be the i40e driver then ? Because IIUC, VFs use a separate driver. Just to rule out the possibility that there might be some driver fixes that could help with this, it might be a good idea to try a 3.19 or later upstream kernel. From dmesg on host: [80326.559674] kvm: zapping shadow pages for mmio generation wraparound [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606 These are harmless and are related to unimplemented PMU msrs, not VFIO. Bandan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Kevin O'Connor ke...@koconnor.net writes: ... Something is very odd here. When I run the above command (on an older AMD machine) I get: Found 128 cpu(s) max supported 128 cpu(s) That first value (1 vs 128) comes from QEMU (via cmos index 0x5f). That is, during smp init, SeaBIOS expects QEMU to tell it how many cpus are active, and SeaBIOS waits until that many CPUs check in from its SIPI request before proceeding. I wonder if QEMU reported only 1 active cpu via that cmos register, but more were actually active. If that was the case, it could I was daring enough to try this and I don't see the crash :) diff --git a/src/fw/smp.c b/src/fw/smp.c index a466ea6..a346d46 100644 --- a/src/fw/smp.c +++ b/src/fw/smp.c @@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id) void VISIBLE32FLAT handle_smp(void) { + dprintf(DEBUG_HDL_smp, Calling handle_smp\n); if (!CONFIG_QEMU) return; @@ -128,6 +129,8 @@ smp_setup(void) // Wait for other CPUs to process the SIPI. u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1; +while (cmos_smp_count == 1) +cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1; while (cmos_smp_count != CountCPUs) asm volatile( // Release lock and allow other processors to use the stack. So, the while loop results in a race somehow ? Bandan certainly explain the failure - as multiple cpus could be running without the sipi trapoline in place. What does the log look like on a non-failure case? -Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Kevin O'Connor ke...@koconnor.net writes: On Wed, Mar 11, 2015 at 01:09:42PM -0400, Bandan Das wrote: Kevin O'Connor ke...@koconnor.net writes: ... Something is very odd here. When I run the above command (on an older AMD machine) I get: Found 128 cpu(s) max supported 128 cpu(s) That first value (1 vs 128) comes from QEMU (via cmos index 0x5f). That is, during smp init, SeaBIOS expects QEMU to tell it how many cpus are active, and SeaBIOS waits until that many CPUs check in from its SIPI request before proceeding. I wonder if QEMU reported only 1 active cpu via that cmos register, but more were actually active. If that was the case, it could I was daring enough to try this and I don't see the crash :) diff --git a/src/fw/smp.c b/src/fw/smp.c index a466ea6..a346d46 100644 --- a/src/fw/smp.c +++ b/src/fw/smp.c @@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id) void VISIBLE32FLAT handle_smp(void) { + dprintf(DEBUG_HDL_smp, Calling handle_smp\n); if (!CONFIG_QEMU) return; @@ -128,6 +129,8 @@ smp_setup(void) // Wait for other CPUs to process the SIPI. u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1; +while (cmos_smp_count == 1) +cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1; That would loop forever if you had -smp 1. Sorry, I should have been more clear. What I meant is if I run with -smp 128 (from Dave's original reproducer), sticking this while loop avoids the crash. So, the rtc_read eventually returns the right number (127), as the above while loop keeps polling. Bandan while (cmos_smp_count != CountCPUs) asm volatile( // Release lock and allow other processors to use the stack. So, the while loop results in a race somehow ? No, the problem is that loop doesn't run at all, and as a result the other cpus end up running random code. SeaBIOS sets up an entry vector for multiple cpus, wakes up the cpus, then tears down the entry vector. If it tears down the entry vector before all the cpus have run through it, then the other cpus can end up running random code. -Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Dr. David Alan Gilbert dgilb...@redhat.com writes: * Kevin O'Connor (ke...@koconnor.net) wrote: On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: For what it's worth, I can't seem to trigger the problem if I move the cmos read above the SIPI/LAPIC code (see patch below). Ugh! That's a seabios bug. Main processor modifies the rtc index (rtc_read()) while APs try to clear the NMI bit by modifying the rtc index (romlayout.S:transition32). I'll put together a fix. The seabios patch below resolves the issue for me. Thanks! Looks good here. Andrey, Paolo, Bandan: Does it fix it for you as well? Works for me too, thanks Kevin! Bandan Dave -Kevin --- a/src/romlayout.S +++ b/src/romlayout.S @@ -22,7 +22,8 @@ // %edx = return location (in 32bit mode) // Clobbers: ecx, flags, segment registers, cr0, idt/gdt DECLFUNC transition32 -transition32_for_smi: +transition32_nmi_off: +// transition32 when NMI and A20 are already initialized movl %eax, %ecx jmp 1f transition32: @@ -205,7 +206,7 @@ __farcall16: entry_smi: // Transition to 32bit mode. movl $1f + BUILD_BIOS_ADDR, %edx -jmp transition32_for_smi +jmp transition32_nmi_off .code32 1: movl $BUILD_SMM_ADDR + 0x8000, %esp calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR @@ -216,8 +217,10 @@ entry_smi: DECLFUNC entry_smp entry_smp: // Transition to 32bit mode. +cli +cld movl $2f + BUILD_BIOS_ADDR, %edx -jmp transition32 +jmp transition32_nmi_off .code32 // Acquire lock and take ownership of shared stack 1: rep ; nop -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] E5-2620v2 - emulation stop error
Dr. David Alan Gilbert dgilb...@redhat.com writes: * Kevin O'Connor (ke...@koconnor.net) wrote: On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote: * Kevin O'Connor (ke...@koconnor.net) wrote: So, I couldn't get this to fail on my older AMD machine at all with the default SeaBIOS code. But, when I change the code with the patch below, it failed right away. [...] And the failed debug output looks like: SeaBIOS (version rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain) [...] cmos_smp_count0=20 [...] cmos_smp_count=1 cmos_smp_count2=1/20 Found 1 cpu(s) max supported 20 cpu(s) I'm going to check the assembly for a compiler error, but is it possible QEMU is returning incorrect data in cmos index 0x5f? I checked the SeaBIOS assembler and it looks sane. So, I think the question is, why is QEMU sometimes returning a 0 instead of 127 from cmos 0x5f. My reading of the logs I've just created is that qemu doesn't think it's ever being asked to read 5f in the failed case: good: pc_cmos_init 5f setting smp_cpus=20 cmos: read index=0x0f val=0x00 cmos: read index=0x34 val=0x00 cmos: read index=0x35 val=0x3f cmos: read index=0x38 val=0x30 cmos: read index=0x3d val=0x12 cmos: read index=0x38 val=0x30 cmos: read index=0x0b val=0x02 cmos: read index=0x0d val=0x80 cmos: read index=0x5f val=0x13 Yeh! cmos: read index=0x0f val=0x00 cmos: read index=0x0f val=0x00 cmos: read index=0x0f val=0x00 bad: pc_cmos_init 5f setting smp_cpus=20 cmos: read index=0x0f val=0x00 cmos: read index=0x34 val=0x00 cmos: read index=0x35 val=0x3f cmos: read index=0x38 val=0x30 cmos: read index=0x3d val=0x12 cmos: read index=0x38 val=0x30 cmos: read index=0x0b val=0x02 cmos: read index=0x0d val=0x80 Oh! cmos: read index=0x0f val=0x00 cmos: read index=0x0f val=0x00 cmos: read index=0x0f val=0x00 rtc_read in hw/rt.c is: u8 rtc_read(u8 index) { index |= NMI_DISABLE_BIT; outb(index, PORT_CMOS_INDEX); return inb(PORT_CMOS_DATA); } Is it possible that the read would happen before the index has been committed ? Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html