Re: VIA Eden X4

2016-01-04 Thread Bandan Das
"Matwey V. Kornilov"  writes:

> Hello,
>
> According to WikiPedia VIA claims x86 hardware assisted virtualization
> for VIA Eden X4 CPU.
> Does anybody know if it is supported by Linux KVM?
>

I can't say for sure but my guess is that it should work since VIA implements
VT-x like virtualization extensions, so KVM will find VMX capable hardware.

Bandan

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 10/12/2015 18:58, Bandan Das wrote:
>>> > Allowing userspace to stop the guest with an emulation failure is a
>> This one I don't :) Userspace started the guest after all, there are other
>> ways for it to kill the guest if it wanted to.
>
> I mean allowing guest userspace to stop the guest.

Sure! Userspace (Qemu) can just reenter the guest when it sees the failure.
Doing it in the host kvm seems overkill.

> Paolo
>
>>> > possible denial of service, similar to L2 stopping L1 with an emulation
>>> > failure.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini <pbonz...@redhat.com> writes:

>> Paolo Bonzini <pbonz...@redhat.com> writes:
>> > On 10/12/2015 18:58, Bandan Das wrote:
>> >>> > Allowing userspace to stop the guest with an emulation failure is a
>> >> This one I don't :) Userspace started the guest after all, there are other
>> >> ways for it to kill the guest if it wanted to.
>> >
>> > I mean allowing guest userspace to stop the guest.
>> 
>> Sure! Userspace (Qemu) can just reenter the guest when it sees the failure.
>> Doing it in the host kvm seems overkill.
>
> Most userspaces will get it wrong.  Doing it once makes sure that you
> do it right.

I don't buy that. As userspace, I prefer getting to know what error the guest
I launched hit and decide what to do. Well, atleast whenever I can. This seems
to be one such case.

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 09/12/2015 23:18, Bandan Das wrote:
>> Commit a2b9e6c1a35afcc09:
>> 
>> KVM: x86: Don't report guest userspace emulation error to userspace
>> 
>> Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
>> user-space") disabled the reporting of L2 (nested guest) emulation 
>> failures to
>> userspace due to race-condition between a vmexit and the instruction 
>> emulator.
>> The same rational applies also to userspace applications that are 
>> permitted by
>> the guest OS to access MMIO area or perform PIO.
>> 
>> This patch extends the current behavior - of injecting a #UD instead of
>> reporting it to userspace - also for guest userspace code.
>> 
>> I searched the archives but failed in finding anything. Can someone please
>> explain why this is needed ? Or, why not let userspace decide what to do 
>> based
>> on the cpl, whether to continue execution or kill the guest ? Is the 
>> assumption
>> here that this is what userspace always wants ?
>
> Not what userspace always wants, but what the guest kernel always wants.

Thanks Paolo, this one I agree.

> Allowing userspace to stop the guest with an emulation failure is a

This one I don't :) Userspace started the guest after all, there are other
ways for it to kill the guest if it wanted to.

> possible denial of service, similar to L2 stopping L1 with an emulation
> failure.
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-09 Thread Bandan Das

Commit a2b9e6c1a35afcc09:

KVM: x86: Don't report guest userspace emulation error to userspace

Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures 
to
userspace due to race-condition between a vmexit and the instruction 
emulator.
The same rational applies also to userspace applications that are permitted 
by
the guest OS to access MMIO area or perform PIO.

This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.

I searched the archives but failed in finding anything. Can someone please
explain why this is needed ? Or, why not let userspace decide what to do based
on the cpl, whether to continue execution or kill the guest ? Is the assumption
here that this is what userspace always wants ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation

2015-11-25 Thread Bandan Das
Haozhong Zhang <haozhong.zh...@intel.com> writes:

> On 11/25/15 10:45, Bandan Das wrote:
>> Haozhong Zhang <haozhong.zh...@intel.com> writes:
>> 
>> > This patch removes the vpid check when emulating nested invvpid
>> > instruction of type all-contexts invalidation. The existing code is
>> > incorrect because:
>> >  (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
>> >  Translations Based on VPID", invvpid instruction does not check
>> >  vpid in the invvpid descriptor when its type is all-contexts
>> >  invalidation.
>> 
>> But iirc isn't vpid=0 reserved for root mode ?
> Yes,
>
>> I think we don't want
>> L1 hypervisor to be able do a invvpid(0).
>
> but the invvpid emulated here is doing the all-contexts invalidation that
> does not use the given vpid and "invalidates all mappings tagged with all
> VPIDs except VPID H" (from Intel SDM).

Actually, vpid_sync_context will always try single invalidation first and
I was concerned that we will end up calling vpid_sync_context(0). But that
will not happen since nested.vpid02 is always allocated a vpid.
So... we are good :)

>> 
>> >  (2) According to the same document, invvpid of type all-contexts
>> >  invalidation does not require there is an active VMCS, so/and
>> >  get_vmcs12() in the existing code may result in a NULL-pointer
>> >  dereference. In practice, it can crash both KVM itself and L1
>> >  hypervisors that use invvpid (e.g. Xen).
>> 
>> If that is the case, then just check if it's null and return without
>> doing anything.
>
> (according to Intel SDM) invvpid of type all-contexts invalidation
> should not trigger a valid vmx fail if vpid in the current VMCS is 0.

No, I meant just skip instruction and return but I doubt if there's
any overhead of flushing mappings that don't exist in the first place.
Anyway, better to do as the spec says.

> However, this check and its following operation do change this semantics
> in nested VMX, so it should be completely removed.
>
>> 
>> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
>> > ---
>> >  arch/x86/kvm/vmx.c | 5 -
>> >  1 file changed, 5 deletions(-)
>> >
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > index 87acc52..af823a3 100644
>> > --- a/arch/x86/kvm/vmx.c
>> > +++ b/arch/x86/kvm/vmx.c
>> > @@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
>> >  
>> >switch (type) {
>> >case VMX_VPID_EXTENT_ALL_CONTEXT:
>> > -  if (get_vmcs12(vcpu)->virtual_processor_id == 0) {
>> > -  nested_vmx_failValid(vcpu,
>> > -  VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
>> > -  return 1;
>> > -  }
>> >__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
>> >nested_vmx_succeed(vcpu);
>> >break;
>> 
>> I also noticed a BUG() here in the default. It might be a good idea to 
>> replace
>> it with a WARN.
>
> Or, in nested_vmx_setup_ctls_msrs():
>
> if (enable_vpid)
> - vmx->nested.nested_vmx_vpid_caps = VMX_VPID_INVVPID_BIT |
> - VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;
> +   vmx->nested.nested_vmx_vpid_caps = VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;
>
> because the current handle_invvpid() only handles all-contexts invalidation.
>
> Haozhong
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation

2015-11-25 Thread Bandan Das
Haozhong Zhang  writes:

> This patch removes the vpid check when emulating nested invvpid
> instruction of type all-contexts invalidation. The existing code is
> incorrect because:
>  (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
>  Translations Based on VPID", invvpid instruction does not check
>  vpid in the invvpid descriptor when its type is all-contexts
>  invalidation.

But iirc isn't vpid=0 reserved for root mode ? I think we don't want
L1 hypervisor to be able do a invvpid(0).

>  (2) According to the same document, invvpid of type all-contexts
>  invalidation does not require there is an active VMCS, so/and
>  get_vmcs12() in the existing code may result in a NULL-pointer
>  dereference. In practice, it can crash both KVM itself and L1
>  hypervisors that use invvpid (e.g. Xen).

If that is the case, then just check if it's null and return without
doing anything.

> Signed-off-by: Haozhong Zhang 
> ---
>  arch/x86/kvm/vmx.c | 5 -
>  1 file changed, 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 87acc52..af823a3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
>  
>   switch (type) {
>   case VMX_VPID_EXTENT_ALL_CONTEXT:
> - if (get_vmcs12(vcpu)->virtual_processor_id == 0) {
> - nested_vmx_failValid(vcpu,
> - VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> - return 1;
> - }
>   __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
>   nested_vmx_succeed(vcpu);
>   break;

I also noticed a BUG() here in the default. It might be a good idea to replace
it with a WARN.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-07 Thread Bandan Das
Joerg Roedel <j...@8bytes.org> writes:

> On Tue, Oct 06, 2015 at 01:59:27PM -0400, Bandan Das wrote:
>> Joerg Roedel <j...@8bytes.org> writes:
>> >
>> > So svm->vmcb->control.next_rip is only written by hardware or in
>> > svm_check_intercept(). Both cases write only to this field, if the
>> > hardware supports X86_FEATURE_NRIPS. The write in nested_svm_vmexit only
>> 
>> Not until commit f104765b4f81fd74d69e0eb161e89096deade2db. So, an older L1
>> kernel will trigger it.
>
> But we don't care if L1 writes something into its own next_rip, as we
> never read this value from its VMCB. We only copy the next_rip value we
> get from our shadow-vmcb to it on an emulated vmexit. So I still don't
> understand what triggers the reported problem or why the WARN_ON is
> necessary.

Ok, looks like I am making some incorrect "vmx" assumptions here. What happens
when we exit from L2 to L0, arent' we looking at the VMCB L1 is using to run
L2 ? Wouldn't that trigger the warning if the host processor does not support
nrips and the field is set ?


>
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: svm: Only propagate next_rip when guest supports it

2015-10-07 Thread Bandan Das
Joerg Roedel  writes:

> On Wed, Oct 07, 2015 at 01:03:35PM +0200, Joerg Roedel wrote:
>> But we don't care if L1 writes something into its own next_rip, as we
>> never read this value from its VMCB. We only copy the next_rip value we
>> get from our shadow-vmcb to it on an emulated vmexit. So I still don't
>> understand what triggers the reported problem or why the WARN_ON is
>> necessary.
>
> Okay, I think I have an idea now. I talked a bit with Dirk and the
> WARN_ON triggers in the guest, and not on the host. This makes a lot
> more sense.
>
> In nested-svm we always copy the next_rip from the shadow-vmcb to the
> guests vmcb, even when the nrips bit in cpuid is not set for the guest.
> This obviously triggers the WARN_ON() in the L1 KVM (I still don't
> understand why the WARN_ON was introduced in the first place).

Ok, understood now. The warn_on would trigger in L1 only if it has
decided to disable nrips for some reason as was the case here. So,
my reasoning behind putting the warning was incorrect. 

> So the right fix is to only copy next_rip to the guests vmcb when its
> cpuid indicates that next_rip is supported there, like in this patch:

Yep, agreed.

> From 019afc60507618b8e44e0c67d5ea2d850d88c9dd Mon Sep 17 00:00:00 2001
> From: Joerg Roedel 
> Date: Wed, 7 Oct 2015 13:38:19 +0200
> Subject: [PATCH] kvm: svm: Only propagate next_rip when guest supports it
>
> Currently we always write the next_rip of the shadow vmcb to
> the guests vmcb when we emulate a vmexit. This could confuse
> the guest when its cpuid indicated no support for the
> next_rip feature.
>
> Fix this by only propagating next_rip if the guest actually
> supports it.
>
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kvm/cpuid.h | 21 +
>  arch/x86/kvm/svm.c   |  7 ++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index dd05b9c..effca1f 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -133,4 +133,25 @@ static inline bool guest_cpuid_has_mpx(struct kvm_vcpu 
> *vcpu)
>   best = kvm_find_cpuid_entry(vcpu, 7, 0);
>   return best && (best->ebx & bit(X86_FEATURE_MPX));
>  }
> +
> +/*
> + * NRIPS is provided through cpuidfn 0x800a.edx bit 3
> + */
> +#define BIT_NRIPS3
> +
> +static inline bool guest_cpuid_has_nrips(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry(vcpu, 0x800a, 0);
> +
> + /*
> +  * NRIPS is a scattered cpuid feature, so we can't use
> +  * X86_FEATURE_NRIPS here (X86_FEATURE_NRIPS would be bit
> +  * position 8, not 3).
> +  */
> + return best && (best->edx & bit(BIT_NRIPS));
> +}
> +#undef BIT_NRIPS
> +
>  #endif
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 94b7d15..e1a8824 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2459,7 +2459,9 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   nested_vmcb->control.exit_info_2   = vmcb->control.exit_info_2;
>   nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info;
>   nested_vmcb->control.exit_int_info_err = 
> vmcb->control.exit_int_info_err;
> - nested_vmcb->control.next_rip  = vmcb->control.next_rip;
> +
> + if (guest_cpuid_has_nrips(vcpu))
> + nested_vmcb->control.next_rip  = vmcb->control.next_rip;
>  
>   /*
>* If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have
> @@ -2714,6 +2716,9 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
>   svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
>   svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
>  
> + /* Clear next_rip, as real hardware would do */
> + nested_vmcb->control.next_rip = 0;
> +

Why do we need this ? And are you sure this is what real hardware does ?
I couldn't find anything in the spec.

>   nested_svm_unmap(page);
>  
>   /* Enter Guest-Mode */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-06 Thread Bandan Das
Joerg Roedel <j...@8bytes.org> writes:

> On Thu, Oct 01, 2015 at 06:31:27PM -0400, Bandan Das wrote:
>> >> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu 
>> >> *vcpu)
>> >>   struct vcpu_svm *svm = to_svm(vcpu);
>> >>  
>> >>   if (svm->vmcb->control.next_rip != 0) {
>> >> - WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS));
>> >> + WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
>> >>   svm->next_rip = svm->vmcb->control.next_rip;
>> >>   }
>
> I looked again how this could possibly be triggered, and I am somewhat
> confused now.
>
> So svm->vmcb->control.next_rip is only written by hardware or in
> svm_check_intercept(). Both cases write only to this field, if the
> hardware supports X86_FEATURE_NRIPS. The write in nested_svm_vmexit only

Not until commit f104765b4f81fd74d69e0eb161e89096deade2db. So, an older L1
kernel will trigger it.

> targets the guests VMCB, and we don't use that one again.
>
> So I can't see how the WARN_ON above could be triggered. Do I miss
> something or might this also be a miscompilation of static_cpu_has?
>
>
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-06 Thread Bandan Das
Joerg Roedel <j...@8bytes.org> writes:

> On Mon, Oct 05, 2015 at 01:42:43PM -0400, Bandan Das wrote:
>> Joerg Roedel <j...@8bytes.org> writes:
>> 
>> > On Mon, Oct 05, 2015 at 12:54:43PM -0400, Bandan Das wrote:
>> >> Joerg Roedel <j...@8bytes.org> writes:
>> >> The problems is that the next_rip field could be stale. If the processor 
>> >> supports
>> >> next_rip, then it will clear it out on the next entry. If it doesn't,
>> >> an old value just sits there (no matter who wrote it) and the problem
>> >> happens when skip_emulated_instruction advances the rip with an incorrect
>> >> value.
>> >
>> > So the right fix would be to just set the guests next_rip to 0 when we
>> > emulate vmrun, just like real hardware does, no?
>> 
>> Agreed, resetting to 0 if nrips isn't supported seems right. It would still
>> help having a printk_once in this case IMO :)
>
> I meant to reset it always to 0 on vmrun, like real hardware does.

Atleast the spec don't mention this, I don't know how I got that idea :) The 
spec
just say that it gets written to by hardware on certain intercepts and for 
others
it gets reset to 0 on #VMEXIT.

>
>
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-05 Thread Bandan Das
Joerg Roedel <j...@8bytes.org> writes:

> On Mon, Oct 05, 2015 at 12:54:43PM -0400, Bandan Das wrote:
>> Joerg Roedel <j...@8bytes.org> writes:
>> The problems is that the next_rip field could be stale. If the processor 
>> supports
>> next_rip, then it will clear it out on the next entry. If it doesn't,
>> an old value just sits there (no matter who wrote it) and the problem
>> happens when skip_emulated_instruction advances the rip with an incorrect
>> value.
>
> So the right fix would be to just set the guests next_rip to 0 when we
> emulate vmrun, just like real hardware does, no?

Agreed, resetting to 0 if nrips isn't supported seems right. It would still
help having a printk_once in this case IMO :)

>
>   Joerg
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-05 Thread Bandan Das
Dirk Müller  writes:

>> So the right fix would be to just set the guests next_rip to 0 when we
>> emulate vmrun, just like real hardware does, no?
>
> Like this? (Note: I’m not sure what I’m doing here..). I Agree with you that 
> the warning
> for this seems excessive, I’ve just removed it. 
>
>
> From 47db81837b6fe58aa302317bbf2a092b40af0a36 Mon Sep 17 00:00:00 2001
> From: Dirk Mueller 
> Date: Fri, 2 Oct 2015 08:35:24 +0200
> Subject: [PATCH] KVM: nSVM: Check for NRIP support before passing on
>  control.next_rip
>
> NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS], so
> ignore it if the cpu feature is not available. Remove the
> guest-triggerable WARN_ON for this as it just emulates real
> hardware behavior.
>
> Signed-off-by: Dirk Mueller 
> ---
>  arch/x86/kvm/svm.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 0a42859..66e3a4c 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -514,8 +514,10 @@ static void skip_emulated_instruction(struct kvm_vcpu 
> *vcpu)
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
>   if (svm->vmcb->control.next_rip != 0) {
> - WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
> - svm->next_rip = svm->vmcb->control.next_rip;
> + if (static_cpu_has(X86_FEATURE_NRIPS))
> + svm->next_rip = svm->vmcb->control.next_rip;
> + else
> + svm->next_rip = 0;

skip_emulated_instruction probably isn't the right place.

>   }

So, L1 writes this field in svm_check_intercept.

I went back to the spec and it says (15.7.1):
The next sequential instruction pointer (nRIP) is saved in the guest VMCB
control area at location C8h on all #VMEXITs that are due to instruction
intercepts, as defined in Section 15.9 on page 458, as well as MSR and
IOIO intercepts and exceptions caused by the INT3, INTO, and BOUND
instructions. For all other intercepts, nRIP is reset to zero.

So, I think a better way would be to reset the field on vmexit from
L2 to L0 if NRIP isn't supported. Maybe it's enough to do this only for
the intercepts mentioned above but I am not sure.

Also, please include a debug message (pr_debug_once will do). It seems
unnecessary today, but in 2 years when someone is scratching his head why his
nested guests won't run, he will thank you!

BTW, it seems possible to unconditionally advertise SVM_FEATURE_NRIP..

>   if (!svm->next_rip) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-05 Thread Bandan Das
Joerg Roedel <j...@8bytes.org> writes:

> On Thu, Oct 01, 2015 at 06:31:27PM -0400, Bandan Das wrote:
>> Paolo Bonzini <pbonz...@redhat.com> writes:
>> 
>> > On 01/10/2015 13:43, Dirk Müller wrote:
>> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> >> index 94b7d15..0a42859 100644
>> >> --- a/arch/x86/kvm/svm.c
>> >> +++ b/arch/x86/kvm/svm.c
>> >> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu 
>> >> *vcpu)
>> >>   struct vcpu_svm *svm = to_svm(vcpu);
>> >>  
>> >>   if (svm->vmcb->control.next_rip != 0) {
>> >> - WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS));
>> >> + WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
>> >>   svm->next_rip = svm->vmcb->control.next_rip;
>> >>   }
>> >>  
>> >
>> > Bandan, what was the reason for warning here?
>> 
>> I added the warning so that we catch if the next_rip field is being written
>> to (even if the feature isn't supported) by a buggy L1 hypervisor.
>
> Even if the L1 hypervisor writes to the next_rip field in the VMCB, we
> would never see it in this code path, as we access the shadow VMCB in
> this statement.
>
> We don't even care if the L1 hypervisor writes to its next_rip field
> because we only write to this field on an emulatated VMEXIT and never
> read it back.

The problems is that the next_rip field could be stale. If the processor 
supports
next_rip, then it will clear it out on the next entry. If it doesn't,
an old value just sits there (no matter who wrote it) and the problem
happens when skip_emulated_instruction advances the rip with an incorrect
value.

> So what's the point in adding a guest-triggerable warning at all?

So, yes, maybe this doesn't have to be a guest specific warning but we still
need to warn if this unsupported field is being written to.

>
>
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-04 Thread Bandan Das
Dirk Müller  writes:

>> I added the warning so that we catch if the next_rip field is being written
>> to (even if the feature isn't supported) by a buggy L1 hypervisor.
>
> Interesting, so how about this patch?
>
>
> From c5c8ea255d680f972cbdfc835cdf352fa78897ae Mon Sep 17 00:00:00 2001
> From: Dirk Mueller 
> Date: Fri, 2 Oct 2015 08:35:24 +0200
> Subject: [PATCH] KVM: nSVM: Check for NRIP support before accepting
>  control.next_rip
>
> NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS], remove
> a WARN_ON_(once) and check for it directly.
>
> Signed-off-by: Dirk Mueller 
> ---
>  arch/x86/kvm/svm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 0a42859..33d36da 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -513,8 +513,8 @@ static void skip_emulated_instruction(struct kvm_vcpu 
> *vcpu)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
> - if (svm->vmcb->control.next_rip != 0) {
> - WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
> + if (static_cpu_has(X86_FEATURE_NRIPS) &&
> + svm->vmcb->control.next_rip != 0) {
>   svm->next_rip = svm->vmcb->control.next_rip;
>   }

Ok, looks good to me. Still, probably a good idea to let the user know if this 
condition is
hit.

Bandan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS

2015-10-01 Thread Bandan Das
Paolo Bonzini  writes:

> On 01/10/2015 13:43, Dirk Müller wrote:
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 94b7d15..0a42859 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -514,7 +514,7 @@ static void skip_emulated_instruction(struct kvm_vcpu 
>> *vcpu)
>>  struct vcpu_svm *svm = to_svm(vcpu);
>>  
>>  if (svm->vmcb->control.next_rip != 0) {
>> -WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS));
>> +WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
>>  svm->next_rip = svm->vmcb->control.next_rip;
>>  }
>>  
>
> Bandan, what was the reason for warning here?

I added the warning so that we catch if the next_rip field is being written
to (even if the feature isn't supported) by a buggy L1 hypervisor.

>From the commit:

 
-   if (svm->vmcb->control.next_rip != 0)
+   if (svm->vmcb->control.next_rip != 0) {
+   WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS));
svm->next_rip = svm->vmcb->control.next_rip;
+   }
 
if (!svm->next_rip) {
if (emulate_instruction(vcpu, EMULTYPE_SKIP) !=
@@ -4317,7 +4319,9 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
break;
}
 
-   vmcb->control.next_rip  = info->next_rip;
+   /* TODO: Advertise NRIPS to guest hypervisor unconditionally */
+   if (static_cpu_has(X86_FEATURE_NRIPS))
+   vmcb->control.next_rip  = info->next_rip;
vmcb->control.exit_code = icpt_info.exit_code;
vmexit = nested_svm_exit_handled(svm);
...

> Should we change the "if" condition to static_cpu_has(X86_FEATURE_NRIPS)
> instead of Dirk's patch?

Yes, seems ok to me. If decodeassist isn't supported then it's
mostly a stale value. It's interesting that that L1 still works even
after we hit this warning!

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: emulate the INVVPID instruction

2015-09-25 Thread Bandan Das
Paolo Bonzini <pbonz...@redhat.com> writes:

> On 24/09/2015 17:45, Bandan Das wrote:
>> > However, I have applied the patch to kvm/queue.  Please send the changes
>> > separately, and I will squash them in the existing VPID patch.
>> 
>> Please don't do this. It's making it really difficult to review these
>> patches individually :( Why not let them get some review time before
>> applying them all together ?
>
> Ok---I did it because it makes sense to keep this patch separate from
> the others.  You can expose VPID even if vpid02 == vpid01 (in fact
> that's what happens if KVM cannot find a vpid02) and in that case this
> patch provides a valid implementation of INVVPID.
>
> Do you think it would help if I posted the whole kvm/queue contents a
> few days before pushing it to kvm/next?

Oh that would be great. Thank you!

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: emulate the INVVPID instruction

2015-09-24 Thread Bandan Das
Paolo Bonzini  writes:
...
>> @@ -7189,7 +7189,28 @@ static int handle_invept(struct kvm_vcpu *vcpu)
>>  
>>  static int handle_invvpid(struct kvm_vcpu *vcpu)
>>  {
>> -kvm_queue_exception(vcpu, UD_VECTOR);
>> +u32 vmx_instruction_info;
>> +unsigned long type;
>> +
>> +if (!nested_vmx_check_permission(vcpu))
>> +return 1;
>> +
>> +vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>> +type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
>> +
>> +switch (type) {
>> +case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
>> +case VMX_VPID_EXTENT_SINGLE_CONTEXT:
>> +case VMX_VPID_EXTENT_ALL_CONTEXT:
>> +vmx_flush_tlb(vcpu);
>> +nested_vmx_succeed(vcpu);
>> +break;
>> +default:
>> +nested_vmx_failInvalid(vcpu);
>> +break;
>> +}
>> +
>> +skip_emulated_instruction(vcpu);
>>  return 1;
>>  }
>>  
>> 
>
> This is not enough.  You need to add a VPID argument to
> vpid_sync_vcpu_single, and inline vmx_flush_tlb in handle_invvpid so
> that it can use the new VPID argument of vpid_sync_vcpu_single.
>
> Note that the "all context" variant can be mapped to
> vpid_sync_vcpu_single with vpid02 as the argument (a nice side effect of
> your vpid02 design).
>
> However, I have applied the patch to kvm/queue.  Please send the changes
> separately, and I will squash them in the existing VPID patch.

Please don't do this. It's making it really difficult to review these
patches individually :( Why not let them get some review time before
applying them all together ?


> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: introduce __vmx_flush_tlb to handle specific vpid

2015-09-24 Thread Bandan Das
Wanpeng Li  writes:

> Introduce __vmx_flush_tlb() to handle specific vpid.
>
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 794c529..7188c5e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1343,13 +1343,13 @@ static void loaded_vmcs_clear(struct loaded_vmcs 
> *loaded_vmcs)
>__loaded_vmcs_clear, loaded_vmcs, 1);
>  }
>  
> -static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
> +static inline void vpid_sync_vcpu_single(int vpid)
>  {
> - if (vmx->vpid == 0)
> + if (vpid == 0)
>   return;
>  
>   if (cpu_has_vmx_invvpid_single())
> - __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0);
> + __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vpid, 0);
>  }
>  
>  static inline void vpid_sync_vcpu_global(void)
> @@ -1358,10 +1358,10 @@ static inline void vpid_sync_vcpu_global(void)
>   __invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0);
>  }
>  
> -static inline void vpid_sync_context(struct vcpu_vmx *vmx)
> +static inline void vpid_sync_context(int vpid)
>  {
>   if (cpu_has_vmx_invvpid_single())
> - vpid_sync_vcpu_single(vmx);
> + vpid_sync_vcpu_single(vpid);
>   else
>   vpid_sync_vcpu_global();
>  }

Not sure myself what's the right thing to do but this may be undesirable
in a nested environment. Assuming the processor supports global invalidation
only, this seems like a easy way for the nested guest to invalidate *all*
mappings - even the L1 specific mappings.


> @@ -3450,9 +3450,9 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
>  
>  #endif
>  
> -static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
> +static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid)
>  {
> - vpid_sync_context(to_vmx(vcpu));
> + vpid_sync_context(vpid);
>   if (enable_ept) {
>   if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>   return;
> @@ -3460,6 +3460,11 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
>   }
>  }
>  
> +static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
> +{
> + __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid);
> +}
> +
>  static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu)
>  {
>   ulong cr0_guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
> @@ -4795,7 +4800,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> init_event)
>   vmx_fpu_activate(vcpu);
>   update_exception_bitmap(vcpu);
>  
> - vpid_sync_context(vmx);
> + vpid_sync_context(vmx->vpid);
>  }
>  
>  /*
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-14 Thread Bandan Das
Wanpeng Li  writes:

> VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
> the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
> between L1 and L2. 
>
> This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
> L2 can be separately treated and avoid TLB flush when swithing between L1 and 
> L2. This patch gets ~3x performance improvement for lmbench 8p/64k ctxsw.

TLB flush does context invalidation and while that should result in
some improvement, I never expected a 3x improvement for any workload!
Interesting :)

> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 39 ---
>  1 file changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index da1590e..06bc31e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1157,6 +1157,11 @@ static inline bool 
> nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
>  }
>  
> +static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
> +{
> + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
> +}
> +
>  static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
>  {
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
> @@ -2471,6 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>   SECONDARY_EXEC_RDTSCP |
>   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> + SECONDARY_EXEC_ENABLE_VPID |
>   SECONDARY_EXEC_APIC_REGISTER_VIRT |
>   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>   SECONDARY_EXEC_WBINVD_EXITING |
> @@ -4160,7 +4166,7 @@ static void allocate_vpid(struct vcpu_vmx *vmx)
>   int vpid;
>  
>   vmx->vpid = 0;
> - if (!enable_vpid)
> + if (!enable_vpid || is_guest_mode(>vcpu))
>   return;
>   spin_lock(_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> @@ -6738,6 +6744,14 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
>   }
>   vmcs12 = kmap(page);
>   vmcs12->launch_state = 0;
> + if (enable_vpid) {
> + if (nested_cpu_has_vpid(vmcs12)) {
> + spin_lock(_vpid_lock);
> + if (vmcs12->virtual_processor_id != 0)
> + __clear_bit(vmcs12->virtual_processor_id, 
> vmx_vpid_bitmap);
> + spin_unlock(_vpid_lock);
> + }
> + }
>   kunmap(page);
>   nested_release_page(page);

I don't think this is enough, we should also check for set "nested" bits
in free_vpid() and clear them. There should be some sort of a mapping between 
the
nested guest vpid and the actual vpid so that we can just clear those bits.

> @@ -9189,6 +9203,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   u32 exec_control;
> + int vpid;
>  
>   vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>   vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> @@ -9438,13 +9453,21 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>   else
>   vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>  
> +

Empty space here.

>   if (enable_vpid) {
> - /*
> -  * Trivially support vpid by letting L2s share their parent
> -  * L1's vpid. TODO: move to a more elaborate solution, giving
> -  * each L2 its own vpid and exposing the vpid feature to L1.
> -  */
> - vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> + if (nested_cpu_has_vpid(vmcs12)) {
> + if (vmcs12->virtual_processor_id == 0) {

Ok, so if we advertise vpid to the nested hypervisor, isn't it going to
attempt writing this field when setting up ? Atleast
that's what Linux does, no ?

> + spin_lock(_vpid_lock);
> + vpid = find_first_zero_bit(vmx_vpid_bitmap, 
> VMX_NR_VPIDS);
> + if (vpid < VMX_NR_VPIDS)
> + __set_bit(vpid, vmx_vpid_bitmap);
> + spin_unlock(_vpid_lock);
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vpid);
> + } else
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, 
> vmcs12->virtual_processor_id);
> + } else
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> +

I guess L1 shouldn't know what vpid L0 chose to run L2. If L1 vmreads,
it should get what it expects for the value of vpid, not the one L0 chose.

>   vmx_flush_tlb(vcpu);
> 

Re: [RFC PATCH 4/5] KVM: x86: enable unhandled MSR exits for vmx

2015-08-24 Thread Bandan Das
Peter Hornyack peterhorny...@google.com writes:

 Set the vm's unhandled_msr_exits flag when user space calls the
 KVM_ENABLE_CAP ioctl with KVM_CAP_UNHANDLED_MSR_EXITS. After kvm fails
 to handle a guest rdmsr or wrmsr, check this flag and exit to user space
 with KVM_EXIT_MSR rather than immediately injecting a GP fault.

 On reentry into kvm, use the complete_userspace_io callback path to call
 vmx_complete_userspace_msr. Complete the MSR access if user space was
 able to handle it successfully, or fail the MSR access and inject a GP
 fault if user space could not handle the access.

...
 +static int vmx_complete_userspace_msr(struct kvm_vcpu *vcpu)
 +{
 + struct msr_data msr;
 +
 + if (vcpu-run-msr.index != vcpu-arch.regs[VCPU_REGS_RCX]) {
 + pr_debug(msr.index 0x%x changed, does not match ecx 0x%lx\n,
 +  vcpu-run-msr.index,
 +  vcpu-arch.regs[VCPU_REGS_RCX]);
 + goto err_out;
 + }
 + msr.index = vcpu-run-msr.index;
 + msr.data = vcpu-run-msr.data;
 + msr.host_initiated = false;
 +
 + switch (vcpu-run-msr.direction) {
 + case KVM_EXIT_MSR_RDMSR:
 + if (vcpu-run-msr.handled == KVM_EXIT_MSR_HANDLED)
 + complete_rdmsr(vcpu, msr);
 + else
 + fail_rdmsr(vcpu, msr);
 + break;

Should we check for MSR_UNHANDLED ? Could be debugging relief if userspace is
filling it up with random crap! I think it should be ok if the trace function
catches that too.

 + case KVM_EXIT_MSR_WRMSR:
 + if (vcpu-run-msr.handled == KVM_EXIT_MSR_HANDLED)
 + complete_wrmsr(vcpu, msr);
 + else
 + fail_wrmsr(vcpu, msr);
 + break;
 + default:
 + pr_debug(bad msr.direction %u\n, vcpu-run-msr.direction);
 + goto err_out;
 + }
 +
 + return 1;
 +err_out:
 + vcpu-run-exit_reason = KVM_EXIT_MSR;
 + vcpu-run-msr.direction = KVM_EXIT_MSR_COMPLETION_FAILED;
 + return 0;
 +}
 +
 +/*
 + * Returns 1 if the rdmsr handling is complete; returns 0 if kvm should exit 
 to
 + * user space to handle this rdmsr.
 + */
  static int handle_rdmsr(struct kvm_vcpu *vcpu)
  {
   u32 ecx = vcpu-arch.regs[VCPU_REGS_RCX];
 @@ -5499,6 +5547,16 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu)
   msr_info.index = ecx;
   msr_info.host_initiated = false;
   if (vmx_get_msr(vcpu, msr_info)) {
 + if (vcpu-kvm-arch.unhandled_msr_exits) {
 + vcpu-run-exit_reason = KVM_EXIT_MSR;
 + vcpu-run-msr.direction = KVM_EXIT_MSR_RDMSR;
 + vcpu-run-msr.index = msr_info.index;
 + vcpu-run-msr.data = 0;
 + vcpu-run-msr.handled = 0;
 + vcpu-arch.complete_userspace_io =
 + vmx_complete_userspace_msr;
 + return 0;
 + }
   fail_rdmsr(vcpu, msr_info);
   return 1;
   }
 @@ -5508,6 +5566,10 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu)
   return 1;
  }
  
 +/*
 + * Returns 1 if the wrmsr handling is complete; returns 0 if kvm should exit 
 to
 + * user space to handle this wrmsr.
 + */
  static int handle_wrmsr(struct kvm_vcpu *vcpu)
  {
   struct msr_data msr;
 @@ -5519,6 +5581,16 @@ static int handle_wrmsr(struct kvm_vcpu *vcpu)
   msr.index = ecx;
   msr.host_initiated = false;
   if (kvm_set_msr(vcpu, msr) != 0) {
 + if (vcpu-kvm-arch.unhandled_msr_exits) {

I think kvm should ultimately decide which msrs it can let userspace
handle even if userspace has explicitly enabled the ioctl.

Bandan

 + vcpu-run-exit_reason = KVM_EXIT_MSR;
 + vcpu-run-msr.direction = KVM_EXIT_MSR_WRMSR;
 + vcpu-run-msr.index = msr.index;
 + vcpu-run-msr.data = msr.data;
 + vcpu-run-msr.handled = 0;
 + vcpu-arch.complete_userspace_io =
 + vmx_complete_userspace_msr;
 + return 0;
 + }
   fail_wrmsr(vcpu, msr);
   return 1;
   }
 @@ -8163,7 +8235,7 @@ static bool vmx_xsaves_supported(void)
  
  static bool vmx_msr_exits_supported(void)
  {
 - return false;
 + return true;
  }
  
  static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 4bbc2a1676c9..5c22f4655741 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2461,6 +2461,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
 ext)
   case KVM_CAP_ENABLE_CAP_VM:
   case KVM_CAP_DISABLE_QUIRKS:
   case KVM_CAP_SET_BOOT_CPU_ID:
 + case KVM_CAP_UNHANDLED_MSR_EXITS:
  #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
   case KVM_CAP_ASSIGN_DEV_IRQ:
   case KVM_CAP_PCI_2_3:
 @@ -3568,6 

Re: [RFC PATCH 3/5] KVM: x86: add msr_exits_supported to kvm_x86_ops

2015-08-24 Thread Bandan Das
Peter Hornyack peterhorny...@google.com writes:

 msr_exits_supported will be checked when user space attempts to enable
 the KVM_CAP_UNHANDLED_MSR_EXITS capability for the vm. This is needed
 because MSR exit support will be implemented for vmx but not svm later
 in this patchset.

Is svm future work ? :) Are there any such svm msrs ?


 Signed-off-by: Peter Hornyack peterhorny...@google.com
 ---
  arch/x86/include/asm/kvm_host.h | 1 +
  arch/x86/kvm/svm.c  | 6 ++
  arch/x86/kvm/vmx.c  | 6 ++
  3 files changed, 13 insertions(+)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index c12e845f59e6..a6e145b1e271 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -854,6 +854,7 @@ struct kvm_x86_ops {
   void (*handle_external_intr)(struct kvm_vcpu *vcpu);
   bool (*mpx_supported)(void);
   bool (*xsaves_supported)(void);
 + bool (*msr_exits_supported)(void);
  
   int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
  
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 74d825716f4f..bcbb56f49b9f 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -4249,6 +4249,11 @@ static bool svm_xsaves_supported(void)
   return false;
  }
  
 +static bool svm_msr_exits_supported(void)
 +{
 + return false;
 +}
 +
  static bool svm_has_wbinvd_exit(void)
  {
   return true;
 @@ -4540,6 +4545,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .invpcid_supported = svm_invpcid_supported,
   .mpx_supported = svm_mpx_supported,
   .xsaves_supported = svm_xsaves_supported,
 + .msr_exits_supported = svm_msr_exits_supported,
  
   .set_supported_cpuid = svm_set_supported_cpuid,
  
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index acc38e27d221..27fec385d79d 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8161,6 +8161,11 @@ static bool vmx_xsaves_supported(void)
   SECONDARY_EXEC_XSAVES;
  }
  
 +static bool vmx_msr_exits_supported(void)
 +{
 + return false;
 +}
 +
  static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
  {
   u32 exit_intr_info;
 @@ -10413,6 +10418,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .handle_external_intr = vmx_handle_external_intr,
   .mpx_supported = vmx_mpx_supported,
   .xsaves_supported = vmx_xsaves_supported,
 + .msr_exits_supported = vmx_msr_exits_supported,
  
   .check_nested_events = vmx_check_nested_events,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] KVM: x86: exit to user space on unhandled MSR accesses

2015-08-24 Thread Bandan Das
Peter Hornyack peterhorny...@google.com writes:

 On Wed, Aug 19, 2015 at 2:43 PM, Bandan Das b...@redhat.com wrote:
 Peter Hornyack peterhorny...@google.com writes:

 There are numerous MSRs that kvm does not currently handle. On Intel
 platforms we have observed guest VMs accessing some of these MSRs (for
 example, MSR_PLATFORM_INFO) and behaving poorly (to the point of guest OS
 crashes) when they receive a GP fault because the MSR is not emulated. This
 patchset adds a new kvm exit path for unhandled MSR accesses that allows
 user space to emulate additional MSRs without having to implement them in
 kvm.

 ^^ So, I am trying to understand this motivation. A while back when
 a patch was posted to emulate MSR_PLATFORM_INFO, it was rejected.
 Why ? Because it seemed impossible to emulate it correctly (most concerns 
 were
 related to migration iirc). Although I haven't reviewed all patches in this 
 series
 yet, what I understand from the above message is-It's ok to emulate
 MSR_PLATFORM_INFO *incorrectly* as long as we are doing it in the user 
 space.

 I understand the part where it makes sense to move stuff to userspace.
 But if kvm isn't emulating certain msrs yet, either we should add support,
 or they haven't been added because it's not possible to emulate them
 correctly. The logic that it's probably ok to let userspace do the 
 (incorrect)
 emulation is something I don't understand.

 I wasn't aware of the past discussion of MSR_PLATFORM_INFO, I'll
 search for it - thanks for pointing it out.

 MSR_PLATFORM_INFO is just one example though. In addition to the
 benefits I already mentioned for user space MSR emulation (agility,
 reduced attack surface), this patchset offers a benefit even if user
 space declines to emulate the MSR by returning into kvm with
 KVM_EXIT_MSR_UNHANDLED: it makes it easier to monitor in the first
 place, via logging mechanisms in user space instead of in the kernel,
 for when guests are accessing MSRs that kvm doesn't implement.

Agreed, that would be more useful then how it's being handled currently.

 This patchset has already revealed a few other MSRs (IA32_MPERF,
 IA32_APERF) that guests in our test environment are accessing but
 which kvm doesn't implement yet. I haven't yet investigated deeply
 what these MSRs are used for and how they might be emulated (or if
 they can't actually be emulated correctly), but if we do discover an
 unimplemented non-performance-sensitive MSR that we could emulate
 correctly, with this patchset we would quickly just implement it in
 user space.

Ok, makes sense. But it seems like this could be an area of abuse as
well. For example, this could lessen the incentive to add emulation for
new msrs in kvm and rather just emulate them in userspace ?

However, as you mentioned, it's probably ok as long as there's a clear
demarcation of which msrs this is done for. This shouldn't be
the default behavior.

On a different note, if there's a related qemu change, can you please
point me to it ?

Thanks,
Bandan

 It seems like the next in line
 is to let userspace emulate thier own version of unimplemented x86 
 instructions.


 The core of the patchset modifies the vmx handle_rdmsr and handle_wrmsr
 functions to exit to user space on MSR reads/writes that kvm can't handle
 itself. Then, on the return path into kvm we check for outstanding user
 space MSR completions and either complete the MSR access successfully or
 inject a GP fault as kvm would do by default. This new exit path must be
 enabled for the vm via the KVM_CAP_UNHANDLED_MSR_EXITS capability.

 In the future we plan to extend this functionality to allow user space to
 register the MSRs that it would like to handle itself, even if kvm already
 provides an implementation. In the long-term we will move the

 I seriously hope we don't do this!

 Bandan
 implementation of all non-performance-sensitive MSRs to user space,
 reducing the potential attack surface of kvm and allowing us to respond to
 bugs more quickly.

 This patchset has been tested with our non-qemu user space hypervisor on
 vmx platforms; svm support is not implemented.

 Peter Hornyack (5):
   KVM: x86: refactor vmx rdmsr/wrmsr completion into new functions
   KVM: add KVM_EXIT_MSR exit reason and capability.
   KVM: x86: add msr_exits_supported to kvm_x86_ops
   KVM: x86: enable unhandled MSR exits for vmx
   KVM: x86: add trace events for unhandled MSR exits

  Documentation/virtual/kvm/api.txt |  48 +++
  arch/x86/include/asm/kvm_host.h   |   2 +
  arch/x86/kvm/svm.c|   6 ++
  arch/x86/kvm/trace.h  |  28 +
  arch/x86/kvm/vmx.c| 126 
 ++
  arch/x86/kvm/x86.c|  13 
  include/trace/events/kvm.h|   2 +-
  include/uapi/linux/kvm.h  |  14 +
  8 files changed, 227 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message

Re: [RFC PATCH 0/5] KVM: x86: exit to user space on unhandled MSR accesses

2015-08-19 Thread Bandan Das
Peter Hornyack peterhorny...@google.com writes:

 There are numerous MSRs that kvm does not currently handle. On Intel
 platforms we have observed guest VMs accessing some of these MSRs (for
 example, MSR_PLATFORM_INFO) and behaving poorly (to the point of guest OS
 crashes) when they receive a GP fault because the MSR is not emulated. This
 patchset adds a new kvm exit path for unhandled MSR accesses that allows
 user space to emulate additional MSRs without having to implement them in
 kvm.

^^ So, I am trying to understand this motivation. A while back when 
a patch was posted to emulate MSR_PLATFORM_INFO, it was rejected.
Why ? Because it seemed impossible to emulate it correctly (most concerns were
related to migration iirc). Although I haven't reviewed all patches in this 
series
yet, what I understand from the above message is-It's ok to emulate
MSR_PLATFORM_INFO *incorrectly* as long as we are doing it in the user space.

I understand the part where it makes sense to move stuff to userspace.
But if kvm isn't emulating certain msrs yet, either we should add support,
or they haven't been added because it's not possible to emulate them
correctly. The logic that it's probably ok to let userspace do the (incorrect)
emulation is something I don't understand. It seems like the next in line
is to let userspace emulate thier own version of unimplemented x86 instructions.


 The core of the patchset modifies the vmx handle_rdmsr and handle_wrmsr
 functions to exit to user space on MSR reads/writes that kvm can't handle
 itself. Then, on the return path into kvm we check for outstanding user
 space MSR completions and either complete the MSR access successfully or
 inject a GP fault as kvm would do by default. This new exit path must be
 enabled for the vm via the KVM_CAP_UNHANDLED_MSR_EXITS capability.

 In the future we plan to extend this functionality to allow user space to
 register the MSRs that it would like to handle itself, even if kvm already
 provides an implementation. In the long-term we will move the

I seriously hope we don't do this!

Bandan
 implementation of all non-performance-sensitive MSRs to user space,
 reducing the potential attack surface of kvm and allowing us to respond to
 bugs more quickly.

 This patchset has been tested with our non-qemu user space hypervisor on
 vmx platforms; svm support is not implemented.

 Peter Hornyack (5):
   KVM: x86: refactor vmx rdmsr/wrmsr completion into new functions
   KVM: add KVM_EXIT_MSR exit reason and capability.
   KVM: x86: add msr_exits_supported to kvm_x86_ops
   KVM: x86: enable unhandled MSR exits for vmx
   KVM: x86: add trace events for unhandled MSR exits

  Documentation/virtual/kvm/api.txt |  48 +++
  arch/x86/include/asm/kvm_host.h   |   2 +
  arch/x86/kvm/svm.c|   6 ++
  arch/x86/kvm/trace.h  |  28 +
  arch/x86/kvm/vmx.c| 126 
 ++
  arch/x86/kvm/x86.c|  13 
  include/trace/events/kvm.h|   2 +-
  include/uapi/linux/kvm.h  |  14 +
  8 files changed, 227 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One off question: Hot vertical scaling of a KVM?

2015-08-19 Thread Bandan Das
ROZUMNY, VICTOR vr9...@att.com writes:

 Hello-

 Your site states that one off questions might could be answered via
 email so here I am.

 I have limited knowledge of KVM, but it is my understanding that in
 order to vertically scale RAM for a KVM the guest needs to be shut
 down, resized, and rebooted, resulting in a 5-10 minute interruption
 of service.
What's vertical scaling ? You mean, increase the amount of memory
the guest sees ? You could use Qemu memory hotplug I think but that
requires that the guest has been booted with enough number of
dimm slots.

Bandan

 Is this true and if so do you know of any efforts to change this
 behavior?

 Thank you!

 v.

 Vic Rozumny | Principal Technical Architect AIC - ATT Integrated
 Cloud | Complex Engineering





 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debian 8

2015-08-17 Thread Bandan Das
Mathieu Charrette mathieu.charre...@phoenixsolutions.ca writes:

 Hi,

 I'm trying to install KVM on the latest version of Debian (so Debian
 8) and it's not working. It always gives me a dependencies error. But
 when I install it on Debian 7.8, it works no problem. I was wondering
 if you would have any solution for me.

This is a distro question. Try the Debian forums.
http://forums.debian.net/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

2015-08-10 Thread Bandan Das
Michael S. Tsirkin m...@redhat.com writes:

 On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote:
 vhost threads are per-device, but in most cases a single thread
 is enough. This change creates a single thread that is used to
 serve all guests.
 
 However, this complicates cgroups associations. The current policy
 is to attach the per-device thread to all cgroups of the parent process
 that the device is associated it. This is no longer possible if we
 have a single thread. So, we end up moving the thread around to
 cgroups of whichever device that needs servicing. This is a very
 inefficient protocol but seems to be the only way to integrate
 cgroups support.
 
 Signed-off-by: Razya Ladelsky ra...@il.ibm.com
 Signed-off-by: Bandan Das b...@redhat.com

 BTW, how does this interact with virtio net MQ?
 It would seem that MQ gains from more parallelism and
 CPU locality.

Hm.. Good point. As of this version, this design will always have
one worker thread servicing a guest. Now suppose we have 10 virtio
queues for a guest, surely, we could benefit from spawning off another
worker just like we are doing in case of a new guest/device with
the devs_per_worker parameter.

 ---
  drivers/vhost/scsi.c  |  15 +++--
  drivers/vhost/vhost.c | 150 
 --
  drivers/vhost/vhost.h |  19 +--
  3 files changed, 97 insertions(+), 87 deletions(-)
 
 diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
 index ea32b38..6c42936 100644
 --- a/drivers/vhost/scsi.c
 +++ b/drivers/vhost/scsi.c
 @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct 
 vhost_scsi_cmd *cmd)
  
  llist_add(cmd-tvc_completion_list, vs-vs_completion_list);
  
 -vhost_work_queue(vs-dev, vs-vs_completion_work);
 +vhost_work_queue(vs-dev.worker, vs-vs_completion_work);
  }
  
  static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
 @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
  }
  
  llist_add(evt-list, vs-vs_event_list);
 -vhost_work_queue(vs-dev, vs-vs_event_work);
 +vhost_work_queue(vs-dev.worker, vs-vs_event_work);
  }
  
  static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
 @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
  /* Flush both the vhost poll and vhost work */
  for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
  vhost_scsi_flush_vq(vs, i);
 -vhost_work_flush(vs-dev, vs-vs_completion_work);
 -vhost_work_flush(vs-dev, vs-vs_event_work);
 +vhost_work_flush(vs-dev.worker, vs-vs_completion_work);
 +vhost_work_flush(vs-dev.worker, vs-vs_event_work);
  
  /* Wait for all reqs issued before the flush to be finished */
  for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
 @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, 
 struct file *f)
  if (!vqs)
  goto err_vqs;
  
 -vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work);
 -vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work);
 +vhost_work_init(vs-dev, vs-vs_completion_work,
 +vhost_scsi_complete_cmd_work);
 +
 +vhost_work_init(vs-dev, vs-vs_event_work,
 +vhost_scsi_evt_work);
  
  vs-vs_events_nr = 0;
  vs-vs_events_missed = false;
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 2ee2826..951c96b 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -11,6 +11,8 @@
   * Generic code for virtio server in host kernel.
   */
  
 +#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 +
  #include linux/eventfd.h
  #include linux/vhost.h
  #include linux/uio.h
 @@ -28,6 +30,9 @@
  
  #include vhost.h
  
 +/* Just one worker thread to service all devices */
 +static struct vhost_worker *worker;
 +
  enum {
  VHOST_MEMORY_MAX_NREGIONS = 64,
  VHOST_MEMORY_F_LOG = 0x1,
 @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
 unsigned mode, int sync,
  return 0;
  }
  
 -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 +void vhost_work_init(struct vhost_dev *dev,
 + struct vhost_work *work, vhost_work_fn_t fn)
  {
  INIT_LIST_HEAD(work-node);
  work-fn = fn;
  init_waitqueue_head(work-done);
  work-flushing = 0;
  work-queue_seq = work-done_seq = 0;
 +work-dev = dev;
  }
  EXPORT_SYMBOL_GPL(vhost_work_init);
  
 @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, 
 vhost_work_fn_t fn,
  poll-dev = dev;
  poll-wqh = NULL;
  
 -vhost_work_init(poll-work, fn);
 +vhost_work_init(dev, poll-work, fn);
  }
  EXPORT_SYMBOL_GPL(vhost_poll_init);
  
 @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
  }
  EXPORT_SYMBOL_GPL(vhost_poll_stop);
  
 -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work 
 *work,
 -unsigned seq)
 +static bool vhost_work_seq_done(struct vhost_worker *worker,
 +struct

Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

2015-08-10 Thread Bandan Das
Bandan Das b...@redhat.com writes:

 Michael S. Tsirkin m...@redhat.com writes:

 On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote:
 vhost threads are per-device, but in most cases a single thread
 is enough. This change creates a single thread that is used to
 serve all guests.
 
 However, this complicates cgroups associations. The current policy
 is to attach the per-device thread to all cgroups of the parent process
 that the device is associated it. This is no longer possible if we
 have a single thread. So, we end up moving the thread around to
 cgroups of whichever device that needs servicing. This is a very
 inefficient protocol but seems to be the only way to integrate
 cgroups support.
 
 Signed-off-by: Razya Ladelsky ra...@il.ibm.com
 Signed-off-by: Bandan Das b...@redhat.com

 BTW, how does this interact with virtio net MQ?
 It would seem that MQ gains from more parallelism and
 CPU locality.

 Hm.. Good point. As of this version, this design will always have
 one worker thread servicing a guest. Now suppose we have 10 virtio
 queues for a guest, surely, we could benefit from spawning off another
 worker just like we are doing in case of a new guest/device with
 the devs_per_worker parameter.

So, I did a quick smoke test with virtio-net and the Elvis patches.
virtio net MQ already spawns a new worker thread for every queue,
it seems ? So, the above setup already works! :) I will run some tests and
post back the results.

 ---
  drivers/vhost/scsi.c  |  15 +++--
  drivers/vhost/vhost.c | 150 
 --
  drivers/vhost/vhost.h |  19 +--
  3 files changed, 97 insertions(+), 87 deletions(-)
 
 diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
 index ea32b38..6c42936 100644
 --- a/drivers/vhost/scsi.c
 +++ b/drivers/vhost/scsi.c
 @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct 
 vhost_scsi_cmd *cmd)
  
 llist_add(cmd-tvc_completion_list, vs-vs_completion_list);
  
 -   vhost_work_queue(vs-dev, vs-vs_completion_work);
 +   vhost_work_queue(vs-dev.worker, vs-vs_completion_work);
  }
  
  static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
 @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
 }
  
 llist_add(evt-list, vs-vs_event_list);
 -   vhost_work_queue(vs-dev, vs-vs_event_work);
 +   vhost_work_queue(vs-dev.worker, vs-vs_event_work);
  }
  
  static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
 @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 /* Flush both the vhost poll and vhost work */
 for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
 vhost_scsi_flush_vq(vs, i);
 -   vhost_work_flush(vs-dev, vs-vs_completion_work);
 -   vhost_work_flush(vs-dev, vs-vs_event_work);
 +   vhost_work_flush(vs-dev.worker, vs-vs_completion_work);
 +   vhost_work_flush(vs-dev.worker, vs-vs_event_work);
  
 /* Wait for all reqs issued before the flush to be finished */
 for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
 @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, 
 struct file *f)
 if (!vqs)
 goto err_vqs;
  
 -   vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work);
 -   vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work);
 +   vhost_work_init(vs-dev, vs-vs_completion_work,
 +   vhost_scsi_complete_cmd_work);
 +
 +   vhost_work_init(vs-dev, vs-vs_event_work,
 +   vhost_scsi_evt_work);
  
 vs-vs_events_nr = 0;
 vs-vs_events_missed = false;
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 2ee2826..951c96b 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -11,6 +11,8 @@
   * Generic code for virtio server in host kernel.
   */
  
 +#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 +
  #include linux/eventfd.h
  #include linux/vhost.h
  #include linux/uio.h
 @@ -28,6 +30,9 @@
  
  #include vhost.h
  
 +/* Just one worker thread to service all devices */
 +static struct vhost_worker *worker;
 +
  enum {
 VHOST_MEMORY_MAX_NREGIONS = 64,
 VHOST_MEMORY_F_LOG = 0x1,
 @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
 unsigned mode, int sync,
 return 0;
  }
  
 -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 +void vhost_work_init(struct vhost_dev *dev,
 +struct vhost_work *work, vhost_work_fn_t fn)
  {
 INIT_LIST_HEAD(work-node);
 work-fn = fn;
 init_waitqueue_head(work-done);
 work-flushing = 0;
 work-queue_seq = work-done_seq = 0;
 +   work-dev = dev;
  }
  EXPORT_SYMBOL_GPL(vhost_work_init);
  
 @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, 
 vhost_work_fn_t fn,
 poll-dev = dev;
 poll-wqh = NULL;
  
 -   vhost_work_init(poll-work, fn);
 +   vhost_work_init(dev, poll-work, fn);
  }
  EXPORT_SYMBOL_GPL(vhost_poll_init);
  
 @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
  }
  EXPORT_SYMBOL_GPL(vhost_poll_stop

Re: [RFC PATCH 0/4] Shared vhost design

2015-08-10 Thread Bandan Das
Michael S. Tsirkin m...@redhat.com writes:

 On Sat, Aug 08, 2015 at 07:06:38PM -0400, Bandan Das wrote:
 Hi Michael,
...

  - does the design address the issue of VM 1 being blocked
(e.g. because it hits swap) and blocking VM 2?
 Good question. I haven't thought of this yet. But IIUC,
 the worker thread will complete VM1's job and then move on to
 executing VM2's scheduled work.
 It doesn't matter if VM1 is
 blocked currently. I think it would be a problem though if/when
 polling is introduced.

 Sorry, I wasn't clear. If VM1's memory is in swap, attempts to
 access it might block the service thread, so it won't
 complete VM2's job.

Ah ok, I understand now. I am pretty sure the current RFC doesn't
take care of this :) I will add this to my todo list for v2.

Bandan



 
  
  #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit 
  imposed.
  #  I/O thread runs on CPU 14 or 15 depending on which guest it's serving
  
  There's a simple graph at
  http://people.redhat.com/~bdas/elvis/data/results.png
  that shows how task affinity results in a jump and even without it,
  as the number of guests increase, the shared vhost design performs
  slightly better.
  
  Observations:
  1. In terms of stock performance, the results are comparable.
  2. However, with a tuned setup, even without polling, we see an 
  improvement
  with the new design.
  3. Making the new design simulate old behavior would be a matter of 
  setting
  the number of guests per vhost threads to 1.
  4. Maybe, setting a per guest limit on the work being done by a specific 
  vhost
  thread is needed for it to be fair.
  5. cgroup associations needs to be figured out. I just slightly hacked the
  current cgroup association mechanism to work with the new model. Ccing 
  cgroups
  for input/comments.
  
  Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
  patches, the helpful testing suggestions and discussions.
  
  Bandan Das (4):
vhost: Introduce a universal thread to serve all users
vhost: Limit the number of devices served by a single worker thread
cgroup: Introduce a function to compare cgroups
vhost: Add cgroup-aware creation of worker threads
  
   drivers/vhost/net.c|   6 +-
   drivers/vhost/scsi.c   |  18 ++--
   drivers/vhost/vhost.c  | 272 
  +++--
   drivers/vhost/vhost.h  |  32 +-
   include/linux/cgroup.h |   1 +
   kernel/cgroup.c|  40 
   6 files changed, 275 insertions(+), 94 deletions(-)
  
  -- 
  2.4.3
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

2015-08-08 Thread Bandan Das
Eyal Moscovici eya...@il.ibm.com writes:

 Hi, 

 Do you know what is the overhead of switching the vhost thread from one 
 cgroup to another?

I misinterpreted this question earlier. I think what you are asking here is
that when the vm process is moved from one cgroup to another, what is the
overhead of moving the vhost thread to the new cgroup.

This design does not provide any hooks for the vhost thread to move to
a new cgroup. Rather, I think a better approach is to create a new vhost
thread and bind the process to it if the process is migrated to a new
cgroup. This is much less complicated, and there's a good chance that
it's impossible to migrate the vhost thread since it's serving other guests.
I will address this in v2.

 Eyal Moscovici
 HL-Cloud Infrastructure Solutions
 IBM Haifa Research Lab



 From:   Bandan Das b...@redhat.com
 To: kvm@vger.kernel.org
 Cc: net...@vger.kernel.org, linux-ker...@vger.kernel.org, 
 m...@redhat.com, Eyal Moscovici/Haifa/IBM@IBMIL, Razya 
 Ladelsky/Haifa/IBM@IBMIL, cgro...@vger.kernel.org, jasow...@redhat.com
 Date:   07/13/2015 07:08 AM
 Subject:[RFC PATCH 1/4] vhost: Introduce a universal thread to 
 serve all users



 vhost threads are per-device, but in most cases a single thread
 is enough. This change creates a single thread that is used to
 serve all guests.

 However, this complicates cgroups associations. The current policy
 is to attach the per-device thread to all cgroups of the parent process
 that the device is associated it. This is no longer possible if we
 have a single thread. So, we end up moving the thread around to
 cgroups of whichever device that needs servicing. This is a very
 inefficient protocol but seems to be the only way to integrate
 cgroups support.

 Signed-off-by: Razya Ladelsky ra...@il.ibm.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  drivers/vhost/scsi.c  |  15 +++--
  drivers/vhost/vhost.c | 150 
 --
  drivers/vhost/vhost.h |  19 +--
  3 files changed, 97 insertions(+), 87 deletions(-)

 diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
 index ea32b38..6c42936 100644
 --- a/drivers/vhost/scsi.c
 +++ b/drivers/vhost/scsi.c
 @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct 
 vhost_scsi_cmd *cmd)
  
  llist_add(cmd-tvc_completion_list, 
 vs-vs_completion_list);
  
 -vhost_work_queue(vs-dev, vs-vs_completion_work);
 +vhost_work_queue(vs-dev.worker, 
 vs-vs_completion_work);
  }
  
  static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
 @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
  }
  
  llist_add(evt-list, vs-vs_event_list);
 -vhost_work_queue(vs-dev, vs-vs_event_work);
 +vhost_work_queue(vs-dev.worker, vs-vs_event_work);
  }
  
  static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
 @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
  /* Flush both the vhost poll and vhost work */
  for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
  vhost_scsi_flush_vq(vs, i);
 -vhost_work_flush(vs-dev, vs-vs_completion_work);
 -vhost_work_flush(vs-dev, vs-vs_event_work);
 +vhost_work_flush(vs-dev.worker, 
 vs-vs_completion_work);
 +vhost_work_flush(vs-dev.worker, vs-vs_event_work);
  
  /* Wait for all reqs issued before the flush to be 
 finished */
  for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
 @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, 
 struct file *f)
  if (!vqs)
  goto err_vqs;
  
 -vhost_work_init(vs-vs_completion_work, 
 vhost_scsi_complete_cmd_work);
 -vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work);
 +vhost_work_init(vs-dev, vs-vs_completion_work,
 + vhost_scsi_complete_cmd_work);
 +
 +vhost_work_init(vs-dev, vs-vs_event_work,
 +vhost_scsi_evt_work);
  
  vs-vs_events_nr = 0;
  vs-vs_events_missed = false;
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 2ee2826..951c96b 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -11,6 +11,8 @@
   * Generic code for virtio server in host kernel.
   */
  
 +#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
 +
  #include linux/eventfd.h
  #include linux/vhost.h
  #include linux/uio.h
 @@ -28,6 +30,9 @@
  
  #include vhost.h
  
 +/* Just one worker thread to service all devices */
 +static struct vhost_worker *worker;
 +
  enum {
  VHOST_MEMORY_MAX_NREGIONS = 64,
  VHOST_MEMORY_F_LOG = 0x1,
 @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
 unsigned mode, int sync

Re: [RFC PATCH 0/4] Shared vhost design

2015-08-08 Thread Bandan Das
Hi Michael,

Michael S. Tsirkin m...@redhat.com writes:

 On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote:
 Hello,
 
 There have been discussions on improving the current vhost design. The first
 attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
 worker per cgroup.
 
 http://comments.gmane.org/gmane.linux.network/224730
 
 Later, I posted a cmwq based approach for performance comparisions
 http://comments.gmane.org/gmane.linux.network/286858
 
 More recently was the Elvis work that was presented in KVM Forum 2013
 http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
 
 The Elvis patches rely on common vhost thread design for scalability
 along with polling for performance. Since there are two major changes
 being proposed, we decided to split up the work. The first (this RFC),
 proposing a re-design of the vhost threading model and the second part
 (not posted yet) to focus more on improving performance. 
 
 I am posting this with the hope that we can have a meaningful discussion
 on the proposed new architecture. We have run some tests to show that the new
 design is scalable and in terms of performance, is comparable to the current
 stable design. 
 
 Test Setup:
 The testing is based on the setup described in the Elvis proposal.
 The initial tests are just an aggregate of Netperf STREAM and MAERTS but
 as we progress, I am happy to run more tests. The hosts are two identical
 16 core Haswell systems with point to point network links. For the first 10 
 runs,
 with n=1 upto n=10 guests running in parallel, I booted the target system 
 with nr_cpus=8
 and mem=12G. The purpose was to do a comparision of resource utilization
 and how it affects performance. Finally, with the number of guests set at 14,
 I didn't limit the number of CPUs booted on the host or limit memory seen by
 the kernel but boot the kernel with isolcpus=14,15 that will be used to run
 the vhost threads. The guests are pinned to cpus 0-13 and based on which
 cpu the guest is running on, the corresponding I/O thread is either pinned
 to cpu 14 or 15.
 Results
 # X axis is number of guests
 # Y axis is netperf number
 # nr_cpus=8 and mem=12G
 #Number of Guests#Baseline#ELVIS
 11119.3.0
 2 1135.6   1130.2
 3 1135.5   1131.6
 4 1136.0   1127.1
 5 1118.6   1129.3
 6 1123.4   1129.8
 7 1128.7   1135.4
 8 1129.9   1137.5
 9 1130.6   1135.1
 101129.3   1138.9
 14*   1173.8   1216.9

 I'm a bit too busy now, with 2.4 and related stuff, will review once we
 finish 2.4.  But I'd like to ask two things:
 - did you actually test a config where cgroups were used?

Here are some numbers with a simple cgroup setup.

Three cgroups with cpusets cpu=0,2,4 for cgroup1, cpu=1,3,5 for cgroup2 and 
cpu=6,7
for cgroup3 (even though 6,7 have different numa nodes)

I run netperf for 1 to 9 guests starting with assigning the first guest
to cgroup1, second to cgroup2, third to cgroup3 and repeat this sequence
upto 9 guests.

The numbers  - (TCP_STREAM + TCP_MAERTS)/2

 #Number of Guests #ELVIS (Mbps)
 11056.9
 21122.5
 31122.8
 41123.2
 51122.6
 61110.3
 71116.3
 81121.8
 91118.5

Maybe, my cgroup setup was too simple but these numbers are comparable
to the no cgroups results above. I wrote some tracing code to trace
cgroup_match_groups() and find cgroup search overhead but it seemed
unnecessary for this particular test.


 - does the design address the issue of VM 1 being blocked
   (e.g. because it hits swap) and blocking VM 2?
Good question. I haven't thought of this yet. But IIUC,
the worker thread will complete VM1's job and then move on to
executing VM2's scheduled work. It doesn't matter if VM1 is
blocked currently. I think it would be a problem though if/when
polling is introduced.

 
 #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit 
 imposed.
 #  I/O thread runs on CPU 14 or 15 depending on which guest it's serving
 
 There's a simple graph at
 http://people.redhat.com/~bdas/elvis/data/results.png
 that shows how task affinity results in a jump and even without it,
 as the number of guests increase, the shared vhost design performs
 slightly better.
 
 Observations:
 1. In terms of stock performance, the results are comparable.
 2. However, with a tuned setup, even without polling, we see an improvement
 with the new design.
 3. Making the new design

Re: [PATCH kvm-unit-tests] x86: fix last commit

2015-08-01 Thread Bandan Das
Shih-Wei Li shih...@cs.columbia.edu writes:

 Hi Paolo,

 I've tried to apply the patch, and found that it passed most of the
 problematic tests I mentioned earlier (IPI related, kvmclock_test).
 However, it stopped still at s3 and couldn't finish it. Do you know
 what might go wrong?

Nothing is wrong, that's the way the test is. You need to resume from
qemu for it to proceed and it should quit with 1 for error or 0 for
success.

 Thanks,
 Shih-Wei

 On Thu, Jul 30, 2015 at 9:38 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Keeping the selector that was loaded from the 32-bit GDT is okay, because
 only code segment descriptors differ between 32- and 64-bit mode.  In
 fact the same is true for %ss as well, so let's just remove the whole
 segment loading from load_tss.

 Thanks to Bandan Das for debugging.

 Reported-by: Shih-Wei Li shih...@cs.columbia.edu
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  x86/cstart64.S | 6 --
  1 file changed, 6 deletions(-)

 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 8d5ee2d..e947888 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -213,12 +213,6 @@ idt_descr:

  load_tss:
 lidtq idt_descr
 -   mov $0x10, %eax
 -   mov %ax, %ds
 -   mov %ax, %es
 -   mov %ax, %fs
 -   mov %ax, %gs
 -   mov %ax, %ss
 mov $(APIC_DEFAULT_PHYS_BASE + APIC_ID), %eax
 mov (%rax), %eax
 shr $24, %eax
 --
 1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Shared vhost design

2015-08-01 Thread Bandan Das
Eyal Moscovici eya...@il.ibm.com writes:
...

 We can start to implement polling, but I am unsure if the cgroups 
 integration 
 will be sufficient. The polling vhost thread should be scheduled all
 the time and just moving it from one cgroup to the other wont be 
 sufficient. 
 I think it needs a deeper integration to the point where either we have a 
 vhost thread for each cgroup or the vhost thread enforces the cgroup 
 policies over its polled VM guests.

Agreed, what we have with cgroups is not sufficient. I am waiting
for Tejun et al to comment on our approach :) Michael mentioned whether
it's possible to integrate cgroups into workgroups which I think is a more
generic and the preferred solution. I just don't know yet how easy/difficult
it is to implement this with the new cgroups unified hierarchy.


BTW, I am working on the numbers you had asked for. Honestly, I think the
cost of cgroups could be similar to running a vhost thread/guest since
that is how cgroups integration currently works. But it's good to have the
numbers before us. 

 
 So simple polling by vhost is kind of ok for some guests, but I think to
 really make it work for a reasonably wide selection of guests/workloads
 you need to combine it with 1. polling the NIC - it makes no sense to me
 to only poll one side of the equation; and probably 2. - polling in
 guest.
 

 I agree that we need polling on the NIC which could probably be achieved 
 by using
 the polling interface introduced in kernel 3.11: 
 http://lwn.net/Articles/551284/
 although I never tried using it myself.
 About your point about polling inside the guest, I think it is orthogonal 
 to polling
 in the host.


 Eyal Moscovici
 HL-Cloud Infrastructure Solutions
 IBM Haifa Research Lab
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] x86: fix last commit

2015-08-01 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 01/08/2015 21:05, Bandan Das wrote:
 Shih-Wei Li shih...@cs.columbia.edu writes:
 
 Hi Paolo,

 I've tried to apply the patch, and found that it passed most of the
 problematic tests I mentioned earlier (IPI related, kvmclock_test).
 However, it stopped still at s3 and couldn't finish it. Do you know
 what might go wrong?
 
 Nothing is wrong, that's the way the test is. You need to resume from
 qemu for it to proceed and it should quit with 1 for error or 0 for
 success.

 Actually it should be using the RTC alarm to wake itself up.  But the
 firmware changed recently and the ACPI PMBASE moved from 0xb000 to
 0x600.  Try this (untested):

Ah thanks! your patch works for me. Is this one of the static entries in
the ACPI tables ? I am wondering if we can read this value so it works for
everybody.

 diff --git a/x86/s3.c b/x86/s3.c
 index d568aa7..d6cfef3 100644
 --- a/x86/s3.c
 +++ b/x86/s3.c
 @@ -177,7 +177,7 @@ int main(int argc, char **argv)
   rtc_out(RTC_REG_B, rtc_in(RTC_REG_B) | REG_B_AIE);
  
   *(volatile int*)0 = 0;
 - asm volatile(outw %0, %1 :: a((short)0x2400), 
 d((short)0xb004):memory);
 + asm volatile(outw %0, %1 :: a((short)0x2400), 
 d((short)0x604):memory);
   while(1)
   *(volatile int*)0 = 1;
  

 It's on my todo list to fix a very similar issue in vmexit.flat.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-unit-test failed to complete

2015-07-29 Thread Bandan Das
Shih-Wei Li shih...@cs.columbia.edu writes:

 On Wed, Jul 29, 2015 at 2:38 PM, Shih-Wei Li shih...@cs.columbia.edu wrote:
 On Wed, Jul 29, 2015 at 2:30 PM, Christoffer Dall cd...@cs.columbia.edu 
 wrote:
 Hi Shih-Wei,

 [Something weird happened when sending these e-mails, you sent two where
 one seems to be a slight modification of the other?]

 yes, the previous one just got rejected by the mailing list. sorry
 about the spam.


 On Wed, Jul 29, 2015 at 02:18:23PM -0400, Shih-Wei Li wrote:
 Hi all,

 This is Shih-Wei, I'm Christoffer's colleague at Columbia University.
 We have experienced some problems in running kvm-unit-tests in our
 environment.
 Here's what we did:
 ./configure
 make
 ./run_test.sh

 run_test.sh halted in some specific test items and couldn't finish the
 run. I managed to get it finish running after removing the following
 items:
 x86/apic.c:
 - removed:
 test_sti_nmi();
 test_multiple_nmi();

 I'm wondering if there's a dependency on the version of QEMU?  Have you
 tried with the most recent upstream QEMU?

 I haven't, but I will try. We are using qemu 2.2.50 now.

 Here's the update. I just used the upstream QEMU to run kvm-unit-test,
 but it still failed to finish the same test items I mentioned earlier.

(Just a cursory look) It seems this commit introduced it -

commit 402d4596789335f540a0697955f6dcf43e2bb796
Author: Paolo Bonzini pbonz...@redhat.com
Date:   Thu Jul 23 09:16:45 2015 +0200

x86: load 64-bit segments into the segment registers

kvm-unit-tests was keeping DS/ES/FS/GS loaded with the segment descriptors
provided by the multiboot boot loader (which are 32-bit), and instead 
loading
SS with 0.  The vmx.flat test failed because KVM did not like doing writes
into such an SS.

Load again the segment registers after entering 64-bit mode, for both
the BSP and the APs.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com

diff --git a/x86/cstart64.S b/x86/cstart64.S
index 8d0d95d..8d5ee2d 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -213,7 +213,11 @@ idt_descr:
 
 load_tss:
lidtq idt_descr
-   mov $0, %eax
+   mov $0x10, %eax
+   mov %ax, %ds
+   mov %ax, %es
+   mov %ax, %fs
+   mov %ax, %gs
mov %ax, %ss
mov $(APIC_DEFAULT_PHYS_BASE + APIC_ID), %eax
mov (%rax), %eax

Paolo, what is the purpose of initializing %gs and %fs ? It seems if I comment
out mov %ax, %gs, the test works.

handle_irq() does seem to trigger but the handler doesn't seem
to get called if %gs is initialized. Weird.

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Shared vhost design

2015-07-27 Thread Bandan Das
Eyal Moscovici eya...@il.ibm.com writes:

 Hi, 

 The test showed the same relative numbers as we got in our internal 
 testing. I was wondering about the configuration in regards to NUMA. From
Thanks for confirming.

 our testing we saw that if the VMs are spread across 2 NUMA nodes then 
 having a shared vhost thread per node performs better then having the two 
 threads in the same core.

IIUC, this is similar to my test setup and observations i.e 
 14*   1173.8  1216.9

In this case, there's a shared vhost thread on CPU 14 for numa node 0
and another on CPU 15 for numa node 1. Guests running on CPUs 0,2,4,6,8,10,12
are serviced by vhost-0 that runs on CPU 14 and guests running on CPUs 
1,3,5,7,9,11,13
get serviced by vhost-1 (Numa node 1). I tried some other configurations but
this configuration gave me the best results.


Eyal, I think it makes sense to add polling on top of these patches and
get numbers for them too. Thoughts ?

Bandan

 Eyal Moscovici
 HL-Cloud Infrastructure Solutions
 IBM Haifa Research Lab



 From:   Bandan Das b...@redhat.com
 To: kvm@vger.kernel.org
 Cc: net...@vger.kernel.org, linux-ker...@vger.kernel.org, 
 m...@redhat.com, Eyal Moscovici/Haifa/IBM@IBMIL, Razya 
 Ladelsky/Haifa/IBM@IBMIL, cgro...@vger.kernel.org, jasow...@redhat.com
 Date:   07/13/2015 07:08 AM
 Subject:[RFC PATCH 0/4] Shared vhost design



 Hello,

 There have been discussions on improving the current vhost design. The 
 first
 attempt, to my knowledge was Shirley Ma's patch to create a dedicated 
 vhost
 worker per cgroup.

 http://comments.gmane.org/gmane.linux.network/224730

 Later, I posted a cmwq based approach for performance comparisions
 http://comments.gmane.org/gmane.linux.network/286858

 More recently was the Elvis work that was presented in KVM Forum 2013
 http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf

 The Elvis patches rely on common vhost thread design for scalability
 along with polling for performance. Since there are two major changes
 being proposed, we decided to split up the work. The first (this RFC),
 proposing a re-design of the vhost threading model and the second part
 (not posted yet) to focus more on improving performance. 

 I am posting this with the hope that we can have a meaningful discussion
 on the proposed new architecture. We have run some tests to show that the 
 new
 design is scalable and in terms of performance, is comparable to the 
 current
 stable design. 

 Test Setup:
 The testing is based on the setup described in the Elvis proposal.
 The initial tests are just an aggregate of Netperf STREAM and MAERTS but
 as we progress, I am happy to run more tests. The hosts are two identical
 16 core Haswell systems with point to point network links. For the first 
 10 runs,
 with n=1 upto n=10 guests running in parallel, I booted the target system 
 with nr_cpus=8
 and mem=12G. The purpose was to do a comparision of resource utilization
 and how it affects performance. Finally, with the number of guests set at 
 14,
 I didn't limit the number of CPUs booted on the host or limit memory seen 
 by
 the kernel but boot the kernel with isolcpus=14,15 that will be used to 
 run
 the vhost threads. The guests are pinned to cpus 0-13 and based on which
 cpu the guest is running on, the corresponding I/O thread is either pinned
 to cpu 14 or 15.

 Results
 # X axis is number of guests
 # Y axis is netperf number
 # nr_cpus=8 and mem=12G
 #Number of Guests#Baseline#ELVIS
 11119.3.0
 2 1135.6  1130.2
 3 1135.5  1131.6
 4 1136.0  1127.1
 5 1118.6  1129.3
 6 1123.4  1129.8
 7 1128.7  1135.4
 8 1129.9  1137.5
 9 1130.6  1135.1
 101129.3  1138.9
 14*   1173.8  1216.9

 #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit 
 imposed.
 #  I/O thread runs on CPU 14 or 15 depending on which guest it's serving

 There's a simple graph at
 http://people.redhat.com/~bdas/elvis/data/results.png
 that shows how task affinity results in a jump and even without it,
 as the number of guests increase, the shared vhost design performs
 slightly better.

 Observations:
 1. In terms of stock performance, the results are comparable.
 2. However, with a tuned setup, even without polling, we see an 
 improvement
 with the new design.
 3. Making the new design simulate old behavior would be a matter of 
 setting
 the number of guests per vhost threads to 1.
 4

Re: max number of VCPUs

2015-07-22 Thread Bandan Das
Ozgur O Kilic okil...@binghamton.edu writes:

 Hi,

 I was trying to find out the max number of vpcus and what is the
 reason of having that. I found 2 number about it one is the recommend
 max by KVM  which is 160 and the other number was 255. Can someone
 tell me what are the reasons of this limitations? Why 160 is the
 recommended max and why the max number is 255?

255 comes from apic limitation and 160 because that's the max someone
has ever tested ?

Bandan

 Thank you.
 Ozgur Ozan Kilic
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/10] KVM: MMU: fix decoding cache type from MTRR

2015-07-12 Thread Bandan Das
Alex Williamson alex.william...@redhat.com writes:

 On Wed, 2015-05-13 at 14:42 +0800, Xiao Guangrong wrote:
 There are some bugs in current get_mtrr_type();
 1: bit 1 of mtrr_state-enabled is corresponding bit 11 of
IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement
that means other bits are ignored if it is cleared
 
 2: the fixed MTRR ranges are controlled by bit 0 of
mtrr_state-enabled (bit 10 of IA32_MTRR_DEF_TYPE)
 
 3: if MTRR is disabled, UC is applied to all of physical memory rather
than mtrr_state-def_type
 
 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/kvm/mmu.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)


 I'm seeing a significant regression in boot performance on Intel
 hardware with assigned devices that bisects back to this patch.  There's
 a long delay with Seabios between the version splash and execution of
 option ROMs, and a _very_ long delay with OVMF before the display is
 initialized.  The delay is long enough that users are reporting their
 previously working VM is hung with 100% CPU usage on v4.2-rc1.  Thanks,

 Alex

 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index b78e83f..d00cebd 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2393,19 +2393,20 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page);
  static int get_mtrr_type(struct mtrr_state_type *mtrr_state,
   u64 start, u64 end)
  {
 -int i;
  u64 base, mask;
  u8 prev_match, curr_match;
 -int num_var_ranges = KVM_NR_VAR_MTRR;
 +int i, num_var_ranges = KVM_NR_VAR_MTRR;
  
 -if (!mtrr_state-enabled)
 -return 0xFF;
 +/* MTRR is completely disabled, use UC for all of physical memory. */
 +if (!(mtrr_state-enabled  0x2))
 +return MTRR_TYPE_UNCACHABLE;

Setting this to UC if MTRR is disabled is too restrictive maybe..

Bandan

  /* Make end inclusive end, instead of exclusive */
  end--;
  
  /* Look in fixed ranges. Just return the type as per start */
 -if (mtrr_state-have_fixed  (start  0x10)) {
 +if (mtrr_state-have_fixed  (mtrr_state-enabled  0x1) 
 +  (start  0x10)) {
  int idx;
  
  if (start  0x8) {
 @@ -2428,9 +2429,6 @@ static int get_mtrr_type(struct mtrr_state_type 
 *mtrr_state,
   * Look of multiple ranges matching this address and pick type
   * as per MTRR precedence
   */
 -if (!(mtrr_state-enabled  2))
 -return mtrr_state-def_type;
 -
  prev_match = 0xFF;
  for (i = 0; i  num_var_ranges; ++i) {
  unsigned short start_state, end_state;



 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/4] vhost: Limit the number of devices served by a single worker thread

2015-07-12 Thread Bandan Das
When the number of devices increase, the universal thread model
(introduced in the preceding patch) may end up being the bottleneck.
Moreover, a single worker thread also forces us to change cgroups
based on the device we are serving.

We introduce a worker pool struct that starts with one worker
thread and we keep adding more threads when the numbers of devs
reaches a certain threshold. The default value is set at 7 but is
not based on any empirical data. The value can also be changed by
the user with the devs_per_worker module parameter.

Note that this patch doesn't change how cgroups work. We still
keep moving around the worker thread to the cgroups of the
device we are serving at the moment.

Signed-off-by: Razya Ladelsky ra...@il.ibm.com
Signed-off-by: Bandan Das b...@redhat.com
---
 drivers/vhost/net.c   |   6 +--
 drivers/vhost/scsi.c  |   3 +-
 drivers/vhost/vhost.c | 135 +-
 drivers/vhost/vhost.h |  13 -
 4 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7d137a4..7bfa019 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -705,7 +705,8 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
n-vqs[i].vhost_hlen = 0;
n-vqs[i].sock_hlen = 0;
}
-   vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
+   if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX))
+   return dev-err;
 
vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
@@ -801,9 +802,6 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
sockfd_put(rx_sock);
/* Make sure no callbacks are outstanding */
synchronize_rcu_bh();
-   /* We do an extra flush before freeing memory,
-* since jobs can re-queue themselves. */
-   vhost_net_flush(n);
kfree(n-dev.vqs);
kvfree(n);
return 0;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6c42936..97de2db 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1601,7 +1601,8 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
vqs[i] = vs-vqs[i].vq;
vs-vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
-   vhost_dev_init(vs-dev, vqs, VHOST_SCSI_MAX_VQ);
+   if (vhost_dev_init(vs-dev, vqs, VHOST_SCSI_MAX_VQ))
+   return vs-dev.err;
 
vhost_scsi_init_inflight(vs, NULL);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 951c96b..6a5d4c0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,11 +27,19 @@
 #include linux/kthread.h
 #include linux/cgroup.h
 #include linux/module.h
+#include linux/moduleparam.h
 
 #include vhost.h
 
-/* Just one worker thread to service all devices */
-static struct vhost_worker *worker;
+static int __read_mostly devs_per_worker = 7;
+module_param(devs_per_worker, int, S_IRUGO);
+MODULE_PARM_DESC(devs_per_worker, Setup the number of devices being served by 
a worker thread);
+
+/* Only used to give a unique id to a vhost thread at the moment */
+static unsigned int total_vhost_workers;
+
+/* Pool of vhost threads */
+static struct vhost_pool *vhost_pool;
 
 enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -270,6 +278,63 @@ static int vhost_worker(void *data)
return 0;
 }
 
+static void vhost_create_worker(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker;
+   struct vhost_pool *pool = vhost_pool;
+
+   worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+   if (!worker) {
+   dev-err = -ENOMEM;
+   return;
+   }
+
+   worker-thread = kthread_create(vhost_worker,
+   worker,
+   vhost-%d,
+   total_vhost_workers);
+   if (IS_ERR(worker-thread)) {
+   dev-err = PTR_ERR(worker-thread);
+   goto therror;
+   }
+
+   spin_lock_init(worker-work_lock);
+   INIT_LIST_HEAD(worker-work_list);
+   list_add(worker-node, pool-workers);
+   worker-owner = NULL;
+   worker-num_devices++;
+   total_vhost_workers++;
+   dev-worker = worker;
+   dev-worker_assigned = true;
+   return;
+
+therror:
+   if (worker-thread)
+   kthread_stop(worker-thread);
+   kfree(worker);
+}
+
+static int vhost_dev_assign_worker(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker;
+
+   mutex_lock(vhost_pool-pool_lock);
+   list_for_each_entry(worker, vhost_pool-workers, node) {
+   if (worker-num_devices  devs_per_worker) {
+   dev-worker = worker;
+   dev-worker_assigned = true;
+   worker-num_devices++;
+   break

[RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

2015-07-12 Thread Bandan Das
vhost threads are per-device, but in most cases a single thread
is enough. This change creates a single thread that is used to
serve all guests.

However, this complicates cgroups associations. The current policy
is to attach the per-device thread to all cgroups of the parent process
that the device is associated it. This is no longer possible if we
have a single thread. So, we end up moving the thread around to
cgroups of whichever device that needs servicing. This is a very
inefficient protocol but seems to be the only way to integrate
cgroups support.

Signed-off-by: Razya Ladelsky ra...@il.ibm.com
Signed-off-by: Bandan Das b...@redhat.com
---
 drivers/vhost/scsi.c  |  15 +++--
 drivers/vhost/vhost.c | 150 --
 drivers/vhost/vhost.h |  19 +--
 3 files changed, 97 insertions(+), 87 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index ea32b38..6c42936 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd 
*cmd)
 
llist_add(cmd-tvc_completion_list, vs-vs_completion_list);
 
-   vhost_work_queue(vs-dev, vs-vs_completion_work);
+   vhost_work_queue(vs-dev.worker, vs-vs_completion_work);
 }
 
 static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
@@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
}
 
llist_add(evt-list, vs-vs_event_list);
-   vhost_work_queue(vs-dev, vs-vs_event_work);
+   vhost_work_queue(vs-dev.worker, vs-vs_event_work);
 }
 
 static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
@@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
/* Flush both the vhost poll and vhost work */
for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
vhost_scsi_flush_vq(vs, i);
-   vhost_work_flush(vs-dev, vs-vs_completion_work);
-   vhost_work_flush(vs-dev, vs-vs_event_work);
+   vhost_work_flush(vs-dev.worker, vs-vs_completion_work);
+   vhost_work_flush(vs-dev.worker, vs-vs_event_work);
 
/* Wait for all reqs issued before the flush to be finished */
for (i = 0; i  VHOST_SCSI_MAX_VQ; i++)
@@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
if (!vqs)
goto err_vqs;
 
-   vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work);
-   vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work);
+   vhost_work_init(vs-dev, vs-vs_completion_work,
+   vhost_scsi_complete_cmd_work);
+
+   vhost_work_init(vs-dev, vs-vs_event_work,
+   vhost_scsi_evt_work);
 
vs-vs_events_nr = 0;
vs-vs_events_missed = false;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..951c96b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -11,6 +11,8 @@
  * Generic code for virtio server in host kernel.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME :  fmt
+
 #include linux/eventfd.h
 #include linux/vhost.h
 #include linux/uio.h
@@ -28,6 +30,9 @@
 
 #include vhost.h
 
+/* Just one worker thread to service all devices */
+static struct vhost_worker *worker;
+
 enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
@@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned 
mode, int sync,
return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_dev *dev,
+struct vhost_work *work, vhost_work_fn_t fn)
 {
INIT_LIST_HEAD(work-node);
work-fn = fn;
init_waitqueue_head(work-done);
work-flushing = 0;
work-queue_seq = work-done_seq = 0;
+   work-dev = dev;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
@@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t 
fn,
poll-dev = dev;
poll-wqh = NULL;
 
-   vhost_work_init(poll-work, fn);
+   vhost_work_init(dev, poll-work, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
-   unsigned seq)
+static bool vhost_work_seq_done(struct vhost_worker *worker,
+   struct vhost_work *work, unsigned seq)
 {
int left;
 
-   spin_lock_irq(dev-work_lock);
+   spin_lock_irq(worker-work_lock);
left = seq - work-done_seq;
-   spin_unlock_irq(dev-work_lock);
+   spin_unlock_irq(worker-work_lock);
return left = 0;
 }
 
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work)
 {
unsigned seq;
int flushing;
 
-   spin_lock_irq(dev-work_lock

[RFC PATCH 4/4] vhost: Add cgroup-aware creation of worker threads

2015-07-12 Thread Bandan Das
With the help of the cgroup function to compare groups introduced
in the previous patch, this changes worker creation policy.
If the new device belongs to different cgroups than any of the
devices we are currently serving, we end up creating a new worker
thread even if we haven't reached the devs_per_worker threshold

Signed-off-by: Bandan Das b...@redhat.com
---
 drivers/vhost/vhost.c | 47 +++
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a5d4c0..dc0fa37 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -261,12 +261,6 @@ static int vhost_worker(void *data)
use_mm(dev-mm);
}
 
-   /* TODO: Consider a more elegant solution */
-   if (worker-owner != dev-owner) {
-   /* Should check for return value */
-   cgroup_attach_task_all(dev-owner, current);
-   worker-owner = dev-owner;
-   }
work-fn(work);
if (need_resched())
schedule();
@@ -278,6 +272,36 @@ static int vhost_worker(void *data)
return 0;
 }
 
+struct vhost_attach_cgroups_struct {
+   struct vhost_work work;
+   struct task_struct *owner;
+   int ret;
+};
+
+static void vhost_attach_cgroups_work(struct vhost_work *work)
+{
+   struct vhost_attach_cgroups_struct *s;
+
+   s = container_of(work, struct vhost_attach_cgroups_struct, work);
+   s-ret = cgroup_attach_task_all(s-owner, current);
+}
+
+static void vhost_attach_cgroups(struct vhost_dev *dev,
+   struct vhost_worker *worker)
+{
+   struct vhost_attach_cgroups_struct attach;
+
+   attach.owner = dev-owner;
+   vhost_work_init(dev, attach.work, vhost_attach_cgroups_work);
+   vhost_work_queue(worker, attach.work);
+   vhost_work_flush(worker, attach.work);
+
+   if (!attach.ret)
+   worker-owner = dev-owner;
+
+   dev-err = attach.ret;
+}
+
 static void vhost_create_worker(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
@@ -300,8 +324,14 @@ static void vhost_create_worker(struct vhost_dev *dev)
 
spin_lock_init(worker-work_lock);
INIT_LIST_HEAD(worker-work_list);
+
+   /* attach to the cgroups of the process that created us */
+   vhost_attach_cgroups(dev, worker);
+   if (dev-err)
+   goto therror;
+   worker-owner = dev-owner;
+
list_add(worker-node, pool-workers);
-   worker-owner = NULL;
worker-num_devices++;
total_vhost_workers++;
dev-worker = worker;
@@ -320,7 +350,8 @@ static int vhost_dev_assign_worker(struct vhost_dev *dev)
 
mutex_lock(vhost_pool-pool_lock);
list_for_each_entry(worker, vhost_pool-workers, node) {
-   if (worker-num_devices  devs_per_worker) {
+   if (worker-num_devices  devs_per_worker 
+   (!cgroup_match_groups(dev-owner, worker-owner))) {
dev-worker = worker;
dev-worker_assigned = true;
worker-num_devices++;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/4] cgroup: Introduce a function to compare cgroups

2015-07-12 Thread Bandan Das
This function takes two tasks and iterates through all
hierarchies to check if they belong to the same cgroups.
It ignores the check for default hierarchies or for
hierarchies with no subsystems attached. This function
will be used by the next patch to add rudimentary cgroup support
with vhost workers.

Signed-off-by: Bandan Das b...@redhat.com
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 40 
 2 files changed, 41 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..606fb5b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -933,6 +933,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
 
+int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd54..ba4121e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2465,6 +2465,46 @@ out_unlock_cgroup:
 }
 
 /**
+ * cgroup_match_groups - check if tsk1 and tsk2 belong to
+ * same cgroups in all hierarchies
+ * Returns 0 on success
+ */
+int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2)
+{
+   struct cgroup_root *root;
+   int retval = 0;
+
+   WARN_ON(!tsk1 || !tsk2);
+
+   mutex_lock(cgroup_mutex);
+   for_each_root(root) {
+   struct cgroup *cg_tsk1;
+   struct cgroup *cg_tsk2;
+
+   /* Default hierarchy */
+   if (root == cgrp_dfl_root)
+   continue;
+   /* No subsystems attached */
+   if (!root-subsys_mask)
+   continue;
+
+   down_read(css_set_rwsem);
+   cg_tsk1 = task_cgroup_from_root(tsk1, root);
+   cg_tsk2 = task_cgroup_from_root(tsk2, root);
+   up_read(css_set_rwsem);
+
+   if (cg_tsk1 != cg_tsk2) {
+   retval = 1;
+   break;
+   }
+   }
+   mutex_unlock(cgroup_mutex);
+
+   return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_match_groups);
+
+/**
  * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from'
  * @from: attach to all cgroups of a given task
  * @tsk: the task to be attached
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/4] Shared vhost design

2015-07-12 Thread Bandan Das
Hello,

There have been discussions on improving the current vhost design. The first
attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
worker per cgroup.

http://comments.gmane.org/gmane.linux.network/224730

Later, I posted a cmwq based approach for performance comparisions
http://comments.gmane.org/gmane.linux.network/286858

More recently was the Elvis work that was presented in KVM Forum 2013
http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf

The Elvis patches rely on common vhost thread design for scalability
along with polling for performance. Since there are two major changes
being proposed, we decided to split up the work. The first (this RFC),
proposing a re-design of the vhost threading model and the second part
(not posted yet) to focus more on improving performance. 

I am posting this with the hope that we can have a meaningful discussion
on the proposed new architecture. We have run some tests to show that the new
design is scalable and in terms of performance, is comparable to the current
stable design. 

Test Setup:
The testing is based on the setup described in the Elvis proposal.
The initial tests are just an aggregate of Netperf STREAM and MAERTS but
as we progress, I am happy to run more tests. The hosts are two identical
16 core Haswell systems with point to point network links. For the first 10 
runs,
with n=1 upto n=10 guests running in parallel, I booted the target system with 
nr_cpus=8
and mem=12G. The purpose was to do a comparision of resource utilization
and how it affects performance. Finally, with the number of guests set at 14,
I didn't limit the number of CPUs booted on the host or limit memory seen by
the kernel but boot the kernel with isolcpus=14,15 that will be used to run
the vhost threads. The guests are pinned to cpus 0-13 and based on which
cpu the guest is running on, the corresponding I/O thread is either pinned
to cpu 14 or 15.

Results
# X axis is number of guests
# Y axis is netperf number
# nr_cpus=8 and mem=12G
#Number of Guests#Baseline#ELVIS
11119.3   .0
21135.6   1130.2
31135.5   1131.6
41136.0   1127.1
51118.6   1129.3
61123.4   1129.8
71128.7   1135.4
81129.9   1137.5
91130.6   1135.1
10   1129.3   1138.9
14*  1173.8   1216.9

#* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
#  I/O thread runs on CPU 14 or 15 depending on which guest it's serving

There's a simple graph at
http://people.redhat.com/~bdas/elvis/data/results.png
that shows how task affinity results in a jump and even without it,
as the number of guests increase, the shared vhost design performs
slightly better.

Observations:
1. In terms of stock performance, the results are comparable.
2. However, with a tuned setup, even without polling, we see an improvement
with the new design.
3. Making the new design simulate old behavior would be a matter of setting
the number of guests per vhost threads to 1.
4. Maybe, setting a per guest limit on the work being done by a specific vhost
thread is needed for it to be fair.
5. cgroup associations needs to be figured out. I just slightly hacked the
current cgroup association mechanism to work with the new model. Ccing cgroups
for input/comments.

Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
patches, the helpful testing suggestions and discussions.

Bandan Das (4):
  vhost: Introduce a universal thread to serve all users
  vhost: Limit the number of devices served by a single worker thread
  cgroup: Introduce a function to compare cgroups
  vhost: Add cgroup-aware creation of worker threads

 drivers/vhost/net.c|   6 +-
 drivers/vhost/scsi.c   |  18 ++--
 drivers/vhost/vhost.c  | 272 +++--
 drivers/vhost/vhost.h  |  32 +-
 include/linux/cgroup.h |   1 +
 kernel/cgroup.c|  40 
 6 files changed, 275 insertions(+), 94 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-10 Thread Bandan Das
Laszlo Ersek ler...@redhat.com writes:

 On 07/10/15 16:59, Paolo Bonzini wrote:
 
 
 On 10/07/2015 16:57, Laszlo Ersek wrote:
 ... In any case, please understand that I'm not campaigning for this
 warning :) IIRC the warning was your (very welcome!) idea after I
 reported the problem; I'm just trying to ensure that the warning match
 the exact issue I encountered.

 Yup.  I think the right thing to do would be to hide memory above the
 limit.
 How so?

 - The stack would not be doing what the user asks for. Pass -m a_lot,
 and the guest would silently see less memory. If the user found out,
 he'd immediately ask (or set out debugging) why. I think if the user's
 request cannot be satisfied, the stack should fail hard.
 
 That's another possibility.  I think both of them are wrong depending on
 _why_ you're using -m a lot in the first place.
 
 Considering that this really happens (on Xeons) only for 1TB+ guests,

 I reported this issue because I ran into it with a ~64GB guest. From my
 /proc/cpuinfo:

 model name  : Intel(R) Core(TM) i7 CPU   M 620  @ 2.67GHz
 address sizes   : 36 bits physical, 48 bits virtual

 I was specifically developing 64GB+ support for OVMF, and this
 limitation caused me to think that there was a bug in my OVMF patches.
 (There wasn't.) An error message from QEMU, advising me to turn off EPT,
 would have saved me many hours.

Right, I specifically reserved a system with 36 bits physical to reproduce
this and it was very easy to reproduce. If it's a hardware bug, I would say,
it's a very annoying one (if not serious). I wonder if Intel folks can
chime in.

 Thanks
 Laszlo

 it's probably just for debugging and then hiding the memory makes some
 sense.
Actually, I agree with Laszlo here. Hiding memory is synonymous to forcing the
user to use less for the -m argument as is failing. But failing and letting the
user do it himself can save hours of debugging.

Regards,
The confused teenager who can't make up his mind.

 Paolo
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 09/07/2015 08:43, Laszlo Ersek wrote:
 On 07/09/15 08:09, Paolo Bonzini wrote:


 On 09/07/2015 00:36, Bandan Das wrote:
 Let userspace inquire the maximum physical address width
 of the host processors; this can be used to identify maximum
 memory that can be assigned to the guest.

 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  arch/x86/kvm/x86.c   | 3 +++
  include/uapi/linux/kvm.h | 1 +
  2 files changed, 4 insertions(+)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index bbaf44e..97d6746 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
 long ext)
case KVM_CAP_NR_MEMSLOTS:
r = KVM_USER_MEM_SLOTS;
break;
 +  case KVM_CAP_PHY_ADDR_WIDTH:
 +  r = boot_cpu_data.x86_phys_bits;
 +  break;

 Userspace can just use CPUID, can't it?
 
 I believe KVM's cooperation is necessary, for the following reason:
 
 The truncation only occurs when the guest-phys - host-phys translation
 is done in hardware, *and* the phys bits of the host processor are
 insufficient to represent the highest guest-phys address that the guest
 will ever face.
 
 The first condition (of course) means that the truncation depends on EPT
 being enabled. (I didn't test on AMD so I don't know if RVI has the same
 issue.) If EPT is disabled, either because the host processor lacks it,
 or because the respective kvm_intel module parameter is set so, then the
 issue cannot be experienced.
 
 Therefore I believe a KVM patch is necessary.
 
 However, this specific patch doesn't seem sufficient; it should also
 consider whether EPT is enabled. (And the ioctl should be perhaps
 renamed to reflect that -- what QEMU needs to know is not the raw
 physical address width of the host processor, but whether that width
 will cause EPT to silently truncate high guest-phys addresses.)

 Right; if you want to consider whether EPT is enabled (which is the
 right thing to do, albeit it makes for a much bigger patch) a KVM patch
 is necessary.  In that case you also need to patch the API documentation.

Note that this patch really doesn't do anything except for printing a
message that something might potentially go wrong. Without EPT, you don't
hit the processor limitation with your setup, but the user should nevertheless
still be notified. In fact, I think shadow paging code should also emulate
this behavior if the gpa is out of range.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Bandan Das
Laszlo Ersek ler...@redhat.com writes:
...

 First, see my comments on the KVM patch.

 Second, ram_size is not the right thing to compare. What should be
 checked is whether the highest guest-physical address that maps to RAM
 can be represented in the address width of the host processor (and only
 if EPT is enabled, but that sub-condition belongs to the KVM patch).

 Note that this is not the same as the check written in the patch. For
 example, if you assume a 32-bit PCI hole with size 1 GB, then a total
 guest RAM of size 63 GB will result in the highest guest-phys memory
 address being 0xF__, which just fits into 36 bits.

 Correspondingly, the above code would not print the warning for

   -m $((63 * 1024 + 1))

 on my laptop (which has address sizes   : 36 bits physical, ...), even
 though such a guest would not boot for me (with EPT enabled).

 Please see

 http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447

 So, ram_size in the controlling expression should be replaced with
 maximum_guest_ram_address (which should be inclusive, and the = relop
 should be preserved).
 also with memory hotplug tuned on we should check if the end of
 hotplug memory area is less then limit, i.e.:
 
   pcms-hotplug_memory.base + hotplug_mem_size  1ULL  max_phys_bits

 Seems reasonable, thanks for the hint!

Thanks Igor and Laszlo, makes sense. I am wondering if this 1GB PCI
hole is always fixed so that I can simply include that in calculating the 
maximum
guest ram address ? Or do we have to figure that out every time ?

 (The LHS in this instance is exclusive though, so equality should *not*
 trigger the warning. maximum_guest_ram_address is inclusive, and
 equality should trigger the warning. (Although equality seems quite
 impossible in practice.))

 Thanks!
 Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Bandan Das
Bandan Das b...@redhat.com writes:

 Laszlo Ersek ler...@redhat.com writes:
 ...
 Yes.

 Without EPT, you don't
 hit the processor limitation with your setup, but the user should 
 nevertheless
 still be notified.

 I disagree.

 In fact, I think shadow paging code should also emulate
 this behavior if the gpa is out of range.

 I disagree.

 There is no out of range gpa. QEMU allocates enough memory, and it
 should be completely transparent to the guest. The fact that it silently
 breaks with nested paging if the host processor doesn't have enough
 address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
 sure, but I suspect it's a hardware bug). In any case the guest
 shouldn't care at all. It is a *virtual* machine, and the VMM should lie
 to it plausibly enough. How much RAM, and how many phys address bits the
 host has, is a performance question, but it should not be a correctness
 question. A 256 GB guest should run (slowly, but correctly) on a laptop
 that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
 space.

 Because otherwise your argument could be extrapolated as TCG should
 break too if the gpa is 'out of range'.

 So, I disagree. Whatever memory you give to the guest should just work
 (unless of course you want to emulate a small address width for the
 *VCPU*, but that's absolutely not the use case here). What we have here
 is a leaky abstraction: a PCPU limitation giving away a lie that the
 guest should never notice. The guest should be able to use all memory
 that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
 vs. KVM-with-EPT. If the last case cannot work (due to hardware
 limitations), that's fine, but then (and only then) a warning should be
 printed.

 Hmm... Ok, I understand your point. So, this is more like a EPT
 limitation/bug in that Qemu isn't complaining about the memory assigned
 to the guest but EPT code is breaking owing to the processor physical
 address width. And honestly, I now think that this patch just makes the whole
 situation more confusing :) I am wondering if it's just possible for kvm to
 simply throw an error like a EPT misconfiguration or something ..

 Or in other words, if using a hardware assisted mechanism is just not
 possible, KVM will simply not let it run instead of letting a guest
 stuck in boot.

I noticed that when the guest gets stuck, trace shows an endless loop of
EXTERNAL_INTERRUPT exits with code 14 (PF).

There's a note in 28.2.2 of the spec that No processors supporting
the Intel64 architecture support more than 48 physical-address bits..
An attempt to use such an address causes a page fault. So, my first
guess was to print out the guest physical address. That seems to be
well beyond the range and is always 0xff000 (when the guest is stuck).

The other thing I can think of is the EPT entries have bits in the
51:N range set which is reserved and always 0. I haven't verified
but it looks like there's ept_misconfig_inspect_spte() that should
already catch this condition. I am out of ideas for today :)


 ... In any case, please understand that I'm not campaigning for this
 warning :) IIRC the warning was your (very welcome!) idea after I
 reported the problem; I'm just trying to ensure that the warning match
 the exact issue I encountered.

 Thanks!
 Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 09/07/2015 10:26, Laszlo Ersek wrote:
  
  Perhaps KVM could simply hide memory above the limit (i.e. treat it as
  MMIO), and the BIOS could remove RAM above the limit from the e820
  memory map?
 I'd prefer to leave the guest firmware*s* out of this... :)
 
 E820 is a legacy BIOS concept. In OVMF we'd have to hack the memory
 resource descriptor HOBs (which in turn control the DXE memory space
 map, which in turn controls the UEFI memory map). Those HOBs are
 currently based on what the CMOS reports about the RAM available under
 and above 4GB.
 
 It's pretty complex already (will get more complex with SMM support),
 and TBH, for working around such an obscure issue, I wouldn't like to
 complicate it even further...
 
 After all, this is a host platform limitation. The solution should be to
 either move to a more capable host, or do it in software (disable EPT).

 The reason I mentioned the firmware is because you could in principle
 have the same issue on real hardware - say putting 128 GB on your
 laptop.  The firmware should cope with it.

Agreed, it's probably not a good idea to deviate too much from how real
hardware would behave IMO. As a simplification of Paolo's idea, is it
possible for qemu to completely ignore memory above the limit ? Will
that break anything ? :)

 If OVMF does not use etc/e820, it can instead hack the values it reads
 from CMOS, bounding them according to the CPUID value.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Bandan Das
Laszlo Ersek ler...@redhat.com writes:
...
 Yes.

 Without EPT, you don't
 hit the processor limitation with your setup, but the user should 
 nevertheless
 still be notified.

 I disagree.

 In fact, I think shadow paging code should also emulate
 this behavior if the gpa is out of range.

 I disagree.

 There is no out of range gpa. QEMU allocates enough memory, and it
 should be completely transparent to the guest. The fact that it silently
 breaks with nested paging if the host processor doesn't have enough
 address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
 sure, but I suspect it's a hardware bug). In any case the guest
 shouldn't care at all. It is a *virtual* machine, and the VMM should lie
 to it plausibly enough. How much RAM, and how many phys address bits the
 host has, is a performance question, but it should not be a correctness
 question. A 256 GB guest should run (slowly, but correctly) on a laptop
 that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
 space.

 Because otherwise your argument could be extrapolated as TCG should
 break too if the gpa is 'out of range'.

 So, I disagree. Whatever memory you give to the guest should just work
 (unless of course you want to emulate a small address width for the
 *VCPU*, but that's absolutely not the use case here). What we have here
 is a leaky abstraction: a PCPU limitation giving away a lie that the
 guest should never notice. The guest should be able to use all memory
 that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
 vs. KVM-with-EPT. If the last case cannot work (due to hardware
 limitations), that's fine, but then (and only then) a warning should be
 printed.

Hmm... Ok, I understand your point. So, this is more like a EPT
limitation/bug in that Qemu isn't complaining about the memory assigned
to the guest but EPT code is breaking owing to the processor physical
address width. And honestly, I now think that this patch just makes the whole
situation more confusing :) I am wondering if it's just possible for kvm to
simply throw an error like a EPT misconfiguration or something ..

Or in other words, if using a hardware assisted mechanism is just not
possible, KVM will simply not let it run instead of letting a guest
stuck in boot.


 ... In any case, please understand that I'm not campaigning for this
 warning :) IIRC the warning was your (very welcome!) idea after I
 reported the problem; I'm just trying to ensure that the warning match
 the exact issue I encountered.

 Thanks!
 Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Bandan Das
Igor Mammedov imamm...@redhat.com writes:

 On Wed, 08 Jul 2015 18:42:01 -0400
 Bandan Das b...@redhat.com wrote:

 
 If a Linux guest is assigned more memory than is supported
 by the host processor, the guest is unable to boot. That
 is expected, however, there's no message indicating the user
 what went wrong. This change prints a message to stderr if
 KVM has the corresponding capability.
 
 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  linux-headers/linux/kvm.h | 1 +
  target-i386/kvm.c | 6 ++
  2 files changed, 7 insertions(+)
 
 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index 3bac873..6afad49 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 066d03d..66e3448 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  uint64_t shadow_mem;
  int ret;
  struct utsname utsname;
 +int max_phys_bits;
  
  ret = kvm_get_supported_msrs(s);
  if (ret  0) {
 @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  }
  }
  
 +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
 max_phys_bits seems generic enough and could be applied to other targets
 as well.

I am a little clueless about other targets but is figuring this out from 
userspace
as simple as it's on x86 (cpuid)? If not, then I agree, this could be made a 
generic
value.

Bandan

 making it a property of machine, would make accessing/manipulating it easier.
 define default value for machine/TCG mode and when KVM is enabled
 it would override/set its own limit.

 then any board could easily access machine-max_gpa to make board specific
 checks.

 +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
 +fprintf(stderr, Warning: The amount of memory assigned to the 
 guest 
 +is more than that supported by the host CPU(s). Guest may be 
 unstable.\n);
 +
  if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
  smram_machine_done.notify = register_smram_listener;
  qemu_add_machine_init_done_notifier(smram_machine_done);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: Add host physical address width capability

2015-07-08 Thread Bandan Das

Let userspace inquire the maximum physical address width
of the host processors; this can be used to identify maximum
memory that can be assigned to the guest.

Reported-by: Laszlo Ersek ler...@redhat.com
Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/x86.c   | 3 +++
 include/uapi/linux/kvm.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbaf44e..97d6746 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_NR_MEMSLOTS:
r = KVM_USER_MEM_SLOTS;
break;
+   case KVM_CAP_PHY_ADDR_WIDTH:
+   r = boot_cpu_data.x86_phys_bits;
+   break;
case KVM_CAP_PV_MMU:/* obsolete */
r = 0;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 716ad4a..e7949a1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_DISABLE_QUIRKS 116
 #define KVM_CAP_X86_SMM 117
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
+#define KVM_CAP_PHY_ADDR_WIDTH 119
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] target-i386: Sanity check host processor physical address width

2015-07-08 Thread Bandan Das

If a Linux guest is assigned more memory than is supported
by the host processor, the guest is unable to boot. That
is expected, however, there's no message indicating the user
what went wrong. This change prints a message to stderr if
KVM has the corresponding capability.

Reported-by: Laszlo Ersek ler...@redhat.com
Signed-off-by: Bandan Das b...@redhat.com
---
 linux-headers/linux/kvm.h | 1 +
 target-i386/kvm.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 3bac873..6afad49 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_DISABLE_QUIRKS 116
 #define KVM_CAP_X86_SMM 117
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
+#define KVM_CAP_PHY_ADDR_WIDTH 119
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 066d03d..66e3448 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 uint64_t shadow_mem;
 int ret;
 struct utsname utsname;
+int max_phys_bits;
 
 ret = kvm_get_supported_msrs(s);
 if (ret  0) {
@@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 }
 
+max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
+if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
+fprintf(stderr, Warning: The amount of memory assigned to the guest 
+is more than that supported by the host CPU(s). Guest may be 
unstable.\n);
+
 if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
 smram_machine_done.notify = register_smram_listener;
 qemu_add_machine_init_done_notifier(smram_machine_done);
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 100671] New: vmwrite error in vmx_vcpu_run

2015-06-29 Thread Bandan Das
bugzilla-dae...@bugzilla.kernel.org writes:

Can you please provide a little bit more information ?
When does the write error happen and what guest/host are you running ?
If it's a regression, would it be possible for you to bisect it ?
Is Bug 100661 related to the same hardware ?

Bandan

 https://bugzilla.kernel.org/show_bug.cgi?id=100671

 Bug ID: 100671
Summary: vmwrite error in vmx_vcpu_run
Product: Virtualization
Version: unspecified
 Kernel Version: 4.1.0+
   Hardware: Intel
 OS: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: kvm
   Assignee: virtualization_...@kernel-bugs.osdl.org
   Reporter: jemmy858...@gmail.com
 Regression: No

 Created attachment 181321
   -- https://bugzilla.kernel.org/attachment.cgi?id=181321action=edit
 dmesg.txt

 [ 1614.416934] vmwrite error: reg 6820 value 6 (err 1)
 [ 1614.416939] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7
 [ 1614.416940] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 )
 04/30/2013
 [ 1614.416941]   6138f41b 880007d1fc98
 8166ab0d
 [ 1614.416943]   880007c78000 880007d1fca8
 a082121a
 [ 1614.416945]  880007d1fcb8 a082123f 880007d1fcc8
 a08215ba
 [ 1614.416946] Call Trace:
 [ 1614.416952]  [8166ab0d] dump_stack+0x45/0x57
 [ 1614.416961]  [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel]
 [ 1614.416963]  [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel]
 [ 1614.416966]  [a08215ba] vmx_set_rflags+0x3a/0x40 [kvm_intel]
 [ 1614.416978]  [a07a97fa] __kvm_set_rflags+0x4a/0x60 [kvm]
 [ 1614.416985]  [a07af7a0] x86_emulate_instruction+0x590/0x740 [kvm]
 [ 1614.416988]  [810863b1] ? __set_task_blocked+0x41/0xa0
 [ 1614.416995]  [a07afc6c] complete_emulated_pio+0x3c/0x60 [kvm]
 [ 1614.417003]  [a07b34bc] kvm_arch_vcpu_ioctl_run+0x3cc/0x420 [kvm]
 [ 1614.417009]  [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm]
 [ 1614.417011]  [81204fa4] do_vfs_ioctl+0x2c4/0x4a0
 [ 1614.417013]  [8129e05d] ? selinux_file_ioctl+0x4d/0xc0
 [ 1614.417015]  [812051f9] SyS_ioctl+0x79/0x90
 [ 1614.417017]  [8167126e] entry_SYSCALL_64_fastpath+0x12/0x71
 [ 1614.417020] vmwrite error: reg c08 value 0 (err 1)
 [ 1614.417021] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7
 [ 1614.417022] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 )
 04/30/2013
 [ 1614.417023]   6138f41b 880007d1fc78
 8166ab0d
 [ 1614.417024]   880007c78000 880007d1fc88
 a082121a
 [ 1614.417026]  880007d1fc98 a082123f 880007d1fcd8
 a0823bdc
 [ 1614.417027] Call Trace:
 [ 1614.417029]  [8166ab0d] dump_stack+0x45/0x57
 [ 1614.417032]  [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel]
 [ 1614.417034]  [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel]
 [ 1614.417036]  [a0823bdc] vmx_save_host_state+0x1ac/0x1e0
 [kvm_intel]
 [ 1614.417043]  [a07ac577] vcpu_enter_guest+0x4f7/0xdb0 [kvm]
 [ 1614.417050]  [a07af7a0] ? x86_emulate_instruction+0x590/0x740
 [kvm]
 [ 1614.417051]  [810863b1] ? __set_task_blocked+0x41/0xa0
 [ 1614.417058]  [a07b33e5] kvm_arch_vcpu_ioctl_run+0x2f5/0x420 [kvm]
 [ 1614.417063]  [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm]
 [ 1614.417065]  [81204fa4] do_vfs_ioctl+0x2c4/0x4a0
 [ 1614.417067]  [8129e05d] ? selinux_file_ioctl+0x4d/0xc0
 [ 1614.417069]  [812051f9] SyS_ioctl+0x79/0x90
 [ 1614.417070]  [8167126e] entry_SYSCALL_64_fastpath+0x12/0x71
 [ 1614.417071] vmwrite error: reg c0a value 0 (err 1)
 [ 1614.417073] CPU: 1 PID: 15701 Comm: qemu-system-x86 Not tainted 4.1.0+ #7
 [ 1614.417073] Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET94WW (2.54 )
 04/30/2013
 [ 1614.417074]   6138f41b 880007d1fc78
 8166ab0d
 [ 1614.417075]   880007c78000 880007d1fc88
 a082121a
 [ 1614.417077]  880007d1fc98 a082123f 880007d1fcd8
 a0823acb
 [ 1614.417078] Call Trace:
 [ 1614.417080]  [8166ab0d] dump_stack+0x45/0x57
 [ 1614.417082]  [a082121a] vmwrite_error+0x2a/0x30 [kvm_intel]
 [ 1614.417084]  [a082123f] vmcs_writel+0x1f/0x30 [kvm_intel]
 [ 1614.417087]  [a0823acb] vmx_save_host_state+0x9b/0x1e0 
 [kvm_intel]
 [ 1614.417093]  [a07ac577] vcpu_enter_guest+0x4f7/0xdb0 [kvm]
 [ 1614.417099]  [a07af7a0] ? x86_emulate_instruction+0x590/0x740
 [kvm]
 [ 1614.417100]  [810863b1] ? __set_task_blocked+0x41/0xa0
 [ 1614.417107]  [a07b33e5] kvm_arch_vcpu_ioctl_run+0x2f5/0x420 [kvm]
 [ 1614.417112]  [a079b81e] kvm_vcpu_ioctl+0x33e/0x600 [kvm]
 [ 1614.417114]  

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2015-06-24 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:
...

 I did this:

 1) download
 http://download.pcbsd.org/iso/10.1-RELEASE/amd64/PCBSD10.1.2-x64-trueos-server.raw.xz
 and unpack it

 2) run it with qemu-kvm -drive
 if=virtio,PCBSD10.1.2-x64-trueos-server.raw -smp 2

 3) login as root/pcbsd, type reboot

 I would like to know if I'm doing anything wrong.  My machine is a Xeon
 E5 v3 (Haswell).  My SeaBIOS build doesn't have the atkbd0 bug, but just

I thought (from the bug report) that the bug is only reproducible
on v2 Xeons ?

 to rule that out, can you send me your Seabios binary
 (/usr/share/qemu/bios*.bin) as well?

 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hang on reboot in FreeBSD guest on Linux KVM host

2015-06-22 Thread Bandan Das
John Nielsen li...@jnielsen.net writes:

 On Jun 17, 2014, at 10:48 AM, John Nielsen li...@jnielsen.net wrote:

 On Jun 17, 2014, at 12:05 AM, Gleb Natapov g...@kernel.org wrote:
 
 On Tue, Jun 17, 2014 at 06:21:23AM +0200, Paolo Bonzini wrote:
 Il 16/06/2014 18:47, John Nielsen ha scritto:
 On Jun 16, 2014, at 10:39 AM, Paolo Bonzini pbonz...@redhat.com
 wrote:
 
 Il 16/06/2014 18:09, John Nielsen ha scritto:
 The only substantial difference on the hardware side is the
 CPU.  The hosts where the problem occurs use Intel(R)
 Xeon(R) CPU E5-2650 v2 @ 2.60GHz, while the hosts that don't
 show the problem use the prior revision, Intel(R) Xeon(R)
 CPU E5-2650 0 @ 2.00GHz.
  Can you do grep . /sys/module/kvm_intel/parameters/* on both
 hosts please?
  No differences that I can see. Output below.
  Not really:
 
 Working host: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz # grep
 . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:N
 
 Problem host: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz # grep
 . /sys/module/kvm_intel/parameters/*
 /sys/module/kvm_intel/parameters/enable_apicv:Y
  So we have a clue.  Let me study the code more, I'll try to get
 back with a suggestion.
  Wow, can't believe I missed that. Good catch!
 
 Does disabling apicv on E5-2650 v2 make reboot problem go away?
  Yes it does!
 
 # modprobe kvm_intel /sys/module/kvm_intel/parameters/enable_apicv:Y
 # /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m
 512 -smp 2,sockets=1,cores=1,threads=2 -drive
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net
 none
 
 [problem occurs]
 
 # rmmod kvm_intel # modprobe kvm_intel enable_apicv=N
 /sys/module/kvm_intel/parameters/enable_apicv:N #
 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512
 -smp 2,sockets=1,cores=1,threads=2 -drive
 file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2
 -device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net
 none
 
 [problem does not occur]
 
 Thank you. This both narrows the problem considerably and provides
 an acceptable workaround. It would still be nice to see it fixed, of
 course. Keep me CC'ed as I'm not on the KVM list.

 I’m resurrecting an old thread since I haven’t heard anything in a
 while. Has anyone looked in to the KVM+apicv bug documented above as
 well as here:

 https://bugs.launchpad.net/qemu/+bug/1329956 ?

 If appropriate, where should I go to file a KVM bug (since this isn’t
 really Qemu’s problem)?

Hi John, does this happen with the latest upstream kernel version ?

Bandan

 Thanks,

 JN-- To unsubscribe from this list: send the line unsubscribe kvm in
--
To unsubscribe from this list: send the line unsubscribe kvm in


[PATCH] KVM: nSVM: Check for NRIPS support before updating control field

2015-06-11 Thread Bandan Das

If hardware doesn't support DecodeAssist - a feature that provides
more information about the intercept in the VMCB, KVM decodes the
instruction and then updates the next_rip vmcb control field.
However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS].
Since skip_emulated_instruction() doesn't verify nrip support
before accepting control.next_rip as valid, avoid writing this
field if support isn't present.

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/svm.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9afa233..4911bf1 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -511,8 +511,10 @@ static void skip_emulated_instruction(struct kvm_vcpu 
*vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
-   if (svm-vmcb-control.next_rip != 0)
+   if (svm-vmcb-control.next_rip != 0) {
+   WARN_ON(!static_cpu_has(X86_FEATURE_NRIPS));
svm-next_rip = svm-vmcb-control.next_rip;
+   }
 
if (!svm-next_rip) {
if (emulate_instruction(vcpu, EMULTYPE_SKIP) !=
@@ -4317,7 +4319,9 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
break;
}
 
-   vmcb-control.next_rip  = info-next_rip;
+   /* TODO: Advertise NRIPS to guest hypervisor unconditionally */
+   if (static_cpu_has(X86_FEATURE_NRIPS))
+   vmcb-control.next_rip  = info-next_rip;
vmcb-control.exit_code = icpt_info.exit_code;
vmexit = nested_svm_exit_handled(svm);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] [x86/kvm] emulate.c: Fix comppilation warning

2015-06-09 Thread Bandan Das
Rajat Jain raja...@google.com writes:

 Fix the following warning:

 arch/x86/kvm/emulate.c: In function '__do_insn_fetch_bytes':
 arch/x86/kvm/emulate.c:814:47: warning: 'linear' may be used uninitialized in 
 this function [-Wmaybe-uninitialized]
 arch/x86/kvm/emulate.c:793:16: note: 'linear' was declared here

Is this recent gcc behavior ? If yes, please mention that in the commit
message.

Please follow convention for the subject line and fix the typo -
KVM: emulate: Fix compilation warning

Thanks,
Bandan


 Signed-off-by: Rajat Jain raja...@google.com
 Signed-off-by: Rajat Jain rajatxj...@gmail.com
 ---
 v2: Fix the wrong email address.

  arch/x86/kvm/emulate.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 630bcb0..f06e0ca 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -790,7 +790,7 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt 
 *ctxt, int op_size)
  {
   int rc;
   unsigned size, max_size;
 - unsigned long linear;
 + unsigned long linear = 0;
   int cur_size = ctxt-fetch.end - ctxt-fetch.data;
   struct segmented_address addr = { .seg = VCPU_SREG_CS,
  .ea = ctxt-eip + cur_size };
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: default legacy PCI device assignment support to n

2015-06-04 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 VFIO has proved itself a much better option than KVM's built-in
 device assignment.  It is mature, provides better isolation because
 it enforces ACS, and even the userspace code is being tested on
 a wider variety of hardware these days than the legacy support.

 Disable legacy device assignment by default.

Shouldn't we mark it as Deprecated then ?

Bandan
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/Kconfig | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index 413a7bf9efbb..a0f06a5947c5 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -88,13 +88,14 @@ config KVM_MMU_AUDIT
  config KVM_DEVICE_ASSIGNMENT
   bool KVM legacy PCI device assignment support
   depends on KVM  PCI  IOMMU_API
 - default y
 + default n
   ---help---
 Provide support for legacy PCI device assignment through KVM.  The
 kernel now also supports a full featured userspace device driver
 -   framework through VFIO, which supersedes much of this support.
 +   framework through VFIO, which supersedes this support and provides
 +   better security.
  
 -   If unsure, say Y.
 +   If unsure, say N.
  
  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
  # the virtualization menu.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: default legacy PCI device assignment support to n

2015-06-04 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 04/06/2015 16:31, Bandan Das wrote:
  VFIO has proved itself a much better option than KVM's built-in
  device assignment.  It is mature, provides better isolation because
  it enforces ACS, and even the userspace code is being tested on
  a wider variety of hardware these days than the legacy support.
 
  Disable legacy device assignment by default.

 Shouldn't we mark it as Deprecated then ?

 Yes, good idea!  You mean just  (DEPRECATED) after the string, right?

Yeah, that is what I meant. The Help message does mention VFIO as the
newer option. Sometimes, seeing the word DEPRECATED makes one rethink if they
really need to enable it.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 98841] New: android emulator broken since d7a2a24 kernel commit

2015-05-27 Thread Bandan Das
bugzilla-dae...@bugzilla.kernel.org writes:

 https://bugzilla.kernel.org/show_bug.cgi?id=98841

 Bug ID: 98841
Summary: android emulator broken since d7a2a24 kernel commit
Product: Virtualization
Version: unspecified
 Kernel Version: 3.17
   Hardware: All
 OS: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: kvm
   Assignee: virtualization_...@kernel-bugs.osdl.org
   Reporter: adrian.sa...@asandu.eu
 Regression: No

 Hello,

 For a while now I've noticed the android emulator not wanting to work anymore
 ..
 With a lil' bit of git bisect I've found this kernel commit
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d7a2a246a1b5a0b0c803e800019600051e1e6f1a
 that seems to break it.

 I've mailed the author .. submited a report to the google guys ( not sure if I
 did it right ) https://code.google.com/p/android/issues/detail?id=174557 ...

 The problem was first reported by someone @ fedora .. 
 https://bugzilla.redhat.com/show_bug.cgi?id=1187982

 Alot of details ( run logs ) are there.

 Let me know if I can help test stuff.

Ok, a quick look and it seems the patch is doing the right thing
(as the spec) says. For PHYSBASE it's masking out bits 8-11 and for PHYSMASK
bits 0-10. Are you sure Android Emulator isn't attempting to
write to these reserved regions ?

What happens if you enable ignore_msrs ?
echo Y  /sys/module/kvm/parameters/ignore_msrs

Bandan

 Thanks in advance.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: MMU: fix SMAP virtualization

2015-05-22 Thread Bandan Das
Boris Ostrovsky boris.ostrov...@oracle.com writes:

 On 05/11/2015 10:55 AM, Xiao Guangrong wrote:
 KVM may turn a user page to a kernel page when kernel writes a readonly
 user page if CR0.WP = 1. This shadow page entry will be reused after
 SMAP is enabled so that kernel is allowed to access this user page

 Fix it by setting SMAP  !CR0.WP into shadow page's role and reset mmu
 once CR4.SMAP is updated

 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---



 @@ -4208,12 +4211,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t 
 gpa,
 const u8 *new, int bytes)
   {
  gfn_t gfn = gpa  PAGE_SHIFT;
 -union kvm_mmu_page_role mask = { .word = 0 };
  struct kvm_mmu_page *sp;
  LIST_HEAD(invalid_list);
  u64 entry, gentry, *spte;
  int npte;
  bool remote_flush, local_flush, zap_page;
 +union kvm_mmu_page_role mask = (union kvm_mmu_page_role) {
 +.cr0_wp = 1,
 +.cr4_pae = 1,
 +.nxe = 1,
 +.smep_andnot_wp = 1,
 +.smap_andnot_wp = 1,
 +};




 This breaks older compilers that can't initialize anon structures.

How old ? Even gcc 3.1 says you can use unnamed struct/union fields and
3.2 is the minimum version required to compile the kernel as mentioned
in the README.

We could simply just name the structure, but I doubt this is the
only place in the kernel code where it's being used this way :)

Bandan

 -boris
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: APIC_ID in apic_reg_write()

2015-04-30 Thread Bandan Das
Jan Kiszka jan.kis...@siemens.com writes:
...

 
 Setting r/w permissions on a per-model is little overkill, don't you think ?

 If we want accurate behaviour, we should do this. If not, we probably
 better leave the code alone to avoid surprises for preexisting
 host/guest setups. Modern OSes do not care anyway, but special ones may
 get unhappy.

Agreed, better to leave it alone.

 Jan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure

2015-04-29 Thread Bandan Das
Jan Kiszka jan.kis...@siemens.com writes:

 Am 2015-04-28 um 21:55 schrieb Bandan Das:
 
 If get_free_page() fails for nested bitmap area, it's evident that
 we are gonna get screwed anyway but returning failure because we failed
 allocating memory for a nested structure seems like an unnecessary big
 hammer. Also, save the call for later; after we are done with other
 non-nested allocations.

 Frankly, I prefer failures over automatic degradations. And, as you
 noted, the whole system will probably explode anyway if allocation of a
 single page already fails. So what does this buy us?

Yeah... I hear you. Ok, let me put it this way - Assume that we can
defer this allocation up until the point that the nested subsystem is
actually used i.e L1 tries running a guest and we try to allocate this
area. If get_free_page() failed in that case, would we still want to
kill L1 too ? I guess no.

Also, assume we had a printk in there - Failed allocating memory for
nested bitmap, the novice user is going to get confused why he's
getting an error about nested virtualization (for the not so distant
future when nested is enabled by default :))

 What could makes sense is making the allocation of the vmread/write
 bitmap depend on enable_shadow_vmcs, and that again depend on nested.

Thanks for the suggestion. I will take a look at this one.

 Jan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND kvm-unit-tests] x86/run: Rearrange the valid binary and testdev support checks

2015-04-28 Thread Bandan Das

This extends the sanity checks done on known common Qemu binary
paths when the user supplies a QEMU= on the command line

Fixes: b895b967db94937d5b593c51b95eb32d2889a764

Signed-off-by: Bandan Das b...@redhat.com
---
 x86/run | 43 +++
 1 file changed, 19 insertions(+), 24 deletions(-)

diff --git a/x86/run b/x86/run
index 219a93b..5281fca 100755
--- a/x86/run
+++ b/x86/run
@@ -2,33 +2,28 @@
 NOTFOUND=1
 TESTDEVNOTSUPP=2
 
-qemukvm=${QEMU:-qemu-kvm}
-qemusystem=${QEMU:-qemu-system-x86_64}
+qemubinarysearch=${QEMU:-qemu-kvm qemu-system-x86_64}
 
-if  ! [ -z ${QEMU} ]
-then
-   qemu=${QEMU}
-else
-   for qemucmds in ${qemukvm} ${qemusystem}
-   do
-   unset QEMUFOUND
-   unset qemu
-   if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep 
QEMU)} ] 
-  ${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e 
\pc-testdev\  /dev/null;
-   then
-   qemu=${qemucmds}
-   break
-   fi
-   done
-   if  [ -z ${QEMUFOUND} ]
-   then
-   echo A QEMU binary was not found, You can set a custom 
location by using the QEMU=path environment variable 
-   exit ${NOTFOUND}
-   elif[ -z ${qemu} ]
+for qemucmd in ${qemubinarysearch}
+do
+   unset QEMUFOUND
+   unset qemu
+   if ! [ -z ${QEMUFOUND=$(${qemucmd} --help 2/dev/null | grep QEMU)} 
] 
+   ${qemucmd} -device '?' 21 | grep -F -e \testdev\ -e 
\pc-testdev\  /dev/null;
then
-   echo No Qemu test device support found
-   exit ${TESTDEVNOTSUPP}
+   qemu=${qemucmd}
+   break
fi
+done
+
+if  [ -z ${QEMUFOUND} ]
+then
+   echo A QEMU binary was not found, You can set a custom location by 
using the QEMU=path environment variable 
+   exit ${NOTFOUND}
+elif[ -z ${qemu} ]
+then
+   echo No Qemu test device support found
+   exit ${TESTDEVNOTSUPP}
 fi
 
 if
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure

2015-04-28 Thread Bandan Das

If get_free_page() fails for nested bitmap area, it's evident that
we are gonna get screwed anyway but returning failure because we failed
allocating memory for a nested structure seems like an unnecessary big
hammer. Also, save the call for later; after we are done with other
non-nested allocations.

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/vmx.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f7b6168..200bc5c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6039,20 +6039,22 @@ static __init int hardware_setup(void)
if (!vmx_msr_bitmap_longmode_x2apic)
goto out4;
 
-   if (nested) {
-   vmx_msr_bitmap_nested =
-   (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_nested)
-   goto out5;
-   }
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
-   goto out6;
+   goto out5;
 
vmx_vmwrite_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmwrite_bitmap)
-   goto out7;
+   goto out6;
+
+   if (nested) {
+   vmx_msr_bitmap_nested =
+   (unsigned long *)__get_free_page(GFP_KERNEL);
+   if (!vmx_msr_bitmap_nested) {
+   printk(KERN_WARNING
+  vmx: Failed getting memory for nested 
bitmap\n);
+   nested = 0;
+   }
 
memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
@@ -6073,7 +6075,7 @@ static __init int hardware_setup(void)
 
if (setup_vmcs_config(vmcs_config)  0) {
r = -EIO;
-   goto out8;
+   goto out7;
}
 
if (boot_cpu_has(X86_FEATURE_NX))
@@ -6190,13 +6192,12 @@ static __init int hardware_setup(void)
 
return alloc_kvm_area();
 
-out8:
-   free_page((unsigned long)vmx_vmwrite_bitmap);
 out7:
-   free_page((unsigned long)vmx_vmread_bitmap);
-out6:
if (nested)
free_page((unsigned long)vmx_msr_bitmap_nested);
+   free_page((unsigned long)vmx_vmwrite_bitmap);
+out6:
+   free_page((unsigned long)vmx_vmread_bitmap);
 out5:
free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
 out4:
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] x86/run: Rearrange the valid binary and testdev support checks

2015-04-24 Thread Bandan Das

Ping, Paolo, did this slip through the cracks ?

Bandan Das b...@redhat.com writes:

 This extends the sanity checks done on known common Qemu binary
 paths when the user supplies a QEMU= on the command line

 Fixes: b895b967db94937d5b593c51b95eb32d2889a764

 Signed-off-by: Bandan Das b...@redhat.com
 ---
  x86/run | 43 +++
  1 file changed, 19 insertions(+), 24 deletions(-)

 diff --git a/x86/run b/x86/run
 index 219a93b..5281fca 100755
 --- a/x86/run
 +++ b/x86/run
 @@ -2,33 +2,28 @@
  NOTFOUND=1
  TESTDEVNOTSUPP=2
  
 -qemukvm=${QEMU:-qemu-kvm}
 -qemusystem=${QEMU:-qemu-system-x86_64}
 +qemubinarysearch=${QEMU:-qemu-kvm qemu-system-x86_64}
  
 -if  ! [ -z ${QEMU} ]
 -then
 - qemu=${QEMU}
 -else
 - for qemucmds in ${qemukvm} ${qemusystem}
 - do
 - unset QEMUFOUND
 - unset qemu
 - if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep 
 QEMU)} ] 
 -${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e 
 \pc-testdev\  /dev/null;
 - then
 - qemu=${qemucmds}
 - break
 - fi
 - done
 - if  [ -z ${QEMUFOUND} ]
 - then
 - echo A QEMU binary was not found, You can set a custom 
 location by using the QEMU=path environment variable 
 - exit ${NOTFOUND}
 - elif[ -z ${qemu} ]
 +for qemucmd in ${qemubinarysearch}
 +do
 + unset QEMUFOUND
 + unset qemu
 + if ! [ -z ${QEMUFOUND=$(${qemucmd} --help 2/dev/null | grep QEMU)} 
 ] 
 + ${qemucmd} -device '?' 21 | grep -F -e \testdev\ -e 
 \pc-testdev\  /dev/null;
   then
 - echo No Qemu test device support found
 - exit ${TESTDEVNOTSUPP}
 + qemu=${qemucmd}
 + break
   fi
 +done
 +
 +if  [ -z ${QEMUFOUND} ]
 +then
 + echo A QEMU binary was not found, You can set a custom location by 
 using the QEMU=path environment variable 
 + exit ${NOTFOUND}
 +elif[ -z ${qemu} ]
 +then
 + echo No Qemu test device support found
 + exit ${TESTDEVNOTSUPP}
  fi
  
  if
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x2apic issues with Solaris and Xen guests

2015-04-20 Thread Bandan Das
Michael Tokarev m...@tls.msk.ru writes:

 20.04.2015 20:29, Jan Kiszka wrote:
 On 2015-04-20 19:07, Stefan Hajnoczi wrote:
 I wonder whether the following two x2apic issues are related:

 Solaris 10 U11 network doesn't work
 https://bugzilla.redhat.com/show_bug.cgi?id=1040500

 kvm - fails to setup timer interrupt via io-apic
 (Thanks to Michael Tokarev for posting this link)
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528077#68
 []
 Has anyone looked into this?
 
 Not yet. Is there a handy reproduction guest image? Or maybe someone
 would like to start with tracing what the guest and the host do.

 The second link gives a trivial reproducer, you need just the
 xen hipervisor binary and some kernel.  This should be easy
 too, because it happens right on boot.  But I guess it requires
 some inner knowlege of xen early boot machinery.

Have you tried Radim's patch ?

commit c806a6ad35bfa6c92249cd0ca4772d5ac3f8cb68
Author: Radim Krčmář rkrc...@redhat.com
Date:   Wed Mar 18 19:38:22 2015 +0100

KVM: x86: call irq notifiers with directed EOI


 Thanks,

 /mjt

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vt-x: Preserve host CR4.MCE value while in guest mode.

2015-04-16 Thread Bandan Das
Ben Serebrin sereb...@google.com writes:

I suggest changing the subject to
KVM: VMX: Preserve host CR4.MCE value while in guest mode
That keeps it consistent with the $SUBJECT naming convention KVM
follows.

If Paolo is ok with changing it in his tree directly, that's fine too.

Bandan

 The host's decision to enable machine check exceptions should remain
 in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
 and passed a slightly-modified 0 to the vmcs.guest_cr4 value.

 Tested: Built.
 On earlier version, tested by injecting machine check
 while a guest is spinning.

 Before the change, if guest CR4.MCE==0, then the machine check is
 escalated to Catastrophic Error (CATERR) and the machine dies.
 If guest CR4.MCE==1, then the machine check causes VMEXIT and is
 handled normally by host Linux. After the change, injecting a machine
 check causes normal Linux machine check handling.

 Signed-off-by: Ben Serebrin sereb...@google.com
 ---
  arch/x86/kvm/vmx.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f5e8dce..f7b6168 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3622,8 +3622,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, 
 unsigned long cr3)
  
  static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
  {
 - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
 - KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 + /*
 +  * Pass through host's Machine Check Enable value to hw_cr4, which
 +  * is in force while we are in guest mode.  Do not let guests control
 +  * this bit, even if host CR4.MCE == 0.
 +  */
 + unsigned long hw_cr4 =
 + (cr4_read_shadow()  X86_CR4_MCE) |
 + (cr4  ~X86_CR4_MCE) |
 + (to_vmx(vcpu)-rmode.vm86_active ?
 +  KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
  
   if (cr4  X86_CR4_VMXE) {
   /*
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: use kthread_run macro

2015-04-14 Thread Bandan Das

Code Cleanup, kthread_run is a convenient wrapper
around kthread_create() that even wakes up the process
for us. Use it and remove no longer needed temp
task_struct variable.

Signed-off-by: Bandan Das b...@redhat.com
---
 drivers/vhost/vhost.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..9ac66f7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -366,7 +366,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-   struct task_struct *worker;
int err;
 
/* Is there an owner already? */
@@ -377,15 +376,12 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
/* No owner, become one */
dev-mm = get_task_mm(current);
-   worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid);
-   if (IS_ERR(worker)) {
-   err = PTR_ERR(worker);
+   dev-worker = kthread_run(vhost_worker, dev, vhost-%d, current-pid);
+   if (IS_ERR(dev-worker)) {
+   err = PTR_ERR(dev-worker);
goto err_worker;
}
 
-   dev-worker = worker;
-   wake_up_process(worker);/* avoid contributing to loadavg */
-
err = vhost_attach_cgroups(dev);
if (err)
goto err_cgroup;
@@ -396,7 +392,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
return 0;
 err_cgroup:
-   kthread_stop(worker);
+   kthread_stop(dev-worker);
dev-worker = NULL;
 err_worker:
if (dev-mm)
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Question regarding the reset value of LINT0

2015-04-09 Thread Bandan Das
Nadav Amit nadav.a...@gmail.com writes:

 Avi Kivity avi.kiv...@gmail.com wrote:

 On 04/09/2015 09:21 PM, Nadav Amit wrote:
 Bandan Das b...@redhat.com wrote:
 
 Nadav Amit nadav.a...@gmail.com writes:
 
 Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-04-08 19:40, Nadav Amit wrote:
 Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-04-08 18:59, Nadav Amit wrote:
 Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-04-08 18:40, Nadav Amit wrote:
 Hi,
 
 I would appreciate if someone explains the reason for enabling 
 LINT0 during
 APIC reset. This does not correspond with Intel SDM Figure 10-8: 
 “Local
 Vector Table” that says all LVT registers are reset to 0x1.
 
 In kvm_lapic_reset, I see:
 
 apic_set_reg(apic, APIC_LVT0,
 SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
 
 Which is actually pretty similar to QEMU’s apic_reset_common:
 
 if (bsp) {
 /*
  * LINT0 delivery mode on CPU #0 is set to ExtInt at 
 initialization
  * time typically by BIOS, so PIC interrupt can be delivered to 
 the
  * processor when local APIC is enabled.
  */
 s-lvt[APIC_LVT_LINT0] = 0x700;
 }
 
 Yet, in both cases, I miss the point - if it is typically done by 
 the BIOS,
 why does QEMU or KVM enable it?
 
 BTW: KVM seems to run fine without it, and I think setting it 
 causes me
 problems in certain cases.
 I suspect it has some historic BIOS backgrounds. Already tried to 
 find
 more information in the git logs of both code bases? Or something 
 that
 indicates of SeaBIOS or BochsBIOS once didn't do this initialization?
 Thanks. I found no indication of such thing.
 
 QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says:
 
  Don't route PIC interrupts through the local APIC if the local APIC
  config says so. By Ari Kivity.
 
 Maybe Avi Kivity knows this guy.
 ths? That should have been Thiemo Seufer (IIRC), but he just committed
 the code back then (and is no longer with us, sadly).
 Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny 
 joke
 about Avi knowing “Ari”).
 Ah. No problem. My brain apparently fixed that typo up unnoticed.
 
 But if that commit went in without any BIOS changes around it, QEMU
 simply had to do the job of the latter to keep things working.
 So should I leave it as is? Can I at least disable in KVM during INIT 
 (and
 leave it as is for RESET)?
 No, I don't think there is a need to leave this inaccurate for QEMU if
 our included BIOS gets it right. I don't know what the backward
 bug-compatibility of KVM is, though. Maybe you can identify since when
 our BIOS is fine so that we can discuss time frames.
 I think that it was addressed in commit
 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the
 local APIC of the BSP.”) So it should be included in seabios 0.5.0, which
 means qemu 0.12 - so we are talking about the end of 2009 or start of 
 2010.
 The probability that someone will use a newer version of kernel with 
 something
 as old as 0.12 is probably minimal. I think it's ok to change it with a 
 comment
 indicating the reason. To be on the safe side, however, a user changeable 
 switch
 is something worth considering.
 I don’t see any existing mechanism for KVM to be aware of its user type and
 version. I do see another case of KVM hacks that are intended for fixing
 very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are
 from pretty much the same time-frame of the issue I try to fix).
 
 Since this is something which would follow around, please advise what would
 be the format. A new ioctl that would supply the userspace “type” (according
 to predefined constants) and version?
 
 That would be madness. KVM shouldn't even know that qemu exists, let alone
 track its versions.
 
 Simply add a new toggle KVM_USE_STANDARD_LAPIC_LVT_INIT and document that
 userspace MUST use it. Old userspace won't, and will get the old buggy
 behavior.

 I fully agree it would be madness. Yet it appears to be a recurring problem.
 Here are similar problems  found from a short search:

 1. vmx_set_segment setting segment accessed (3a624e29c75)
 2. svm_set_cr0 clearing CD and NW (709ddebf81c)
 3. Limited number of MTRRs due to Seabios bus (0d234daf7e0a)

 Excluding (1) all of the other issues are related to the VM BIOS. Perhaps
 KVM should somehow realize which VM BIOS runs? (yes, it sounds just as bad.)

How about renaming the toggle Avi mentioned above to something more generic
(KVM_DISABLE_LEGACY_QUIRKS ?) and grouping all the issues together ? Modern 
userspace
will always enable it and get the new correct behavior. When more cases are 
discovered,
KVM can just add them to the list.


 Nadav

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord

Re: [PATCH] x86-run: Print a meaningful message if the qemu binary isn't found

2015-04-09 Thread Bandan Das
Bandan Das b...@redhat.com writes:
...

 I think we still want to print this if the user supplied qemu has no test
 device support. I think this is better -
 --- a/x86/run
 +++ b/x86/run
 @@ -8,6 +8,7 @@ qemusystem=${QEMU:-qemu-system-x86_64}
  if  ! [ -z ${QEMU} ]
  then
 qemu=${QEMU}
 +   QEMUFOUND=1
  else

Damn, please ignore the previous message. This does nothing and I
need more sleep.

Bandan

 for qemucmds in ${qemukvm} ${qemusystem}
 do

 +exit ${TESTDEVNOTSUPP}
 +fi
  fi
  
  if

 which apart from reindentation is just moving a fi down.
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86-run: Print a meaningful message if the qemu binary isn't found

2015-04-09 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:
...
 This breaks QEMU=... ./x86-run ./x86/msr.flat

 Fixed as follows:

 diff --git a/x86/run b/x86/run
 index d5adf8d..219a93b 100755
 --- a/x86/run
 +++ b/x86/run
 @@ -20,16 +20,15 @@ else
   break
   fi
   done
 -fi
 -
 -if  [ -z ${QEMUFOUND} ]
 -then
 - echo A QEMU binary was not found, You can set a custom location by 
 using the QEMU=path environment variable 
 - exit ${NOTFOUND}
 -elif[ -z ${qemu} ]
 -then
 - echo No Qemu test device support found
 - exit ${TESTDEVNOTSUPP}
 + if  [ -z ${QEMUFOUND} ]
 + then
 + echo A QEMU binary was not found, You can set a custom 
 location by using the QEMU=path environment variable 
 + exit ${NOTFOUND}
 + elif[ -z ${qemu} ]
 + then
 + echo No Qemu test device support found

I think we still want to print this if the user supplied qemu has no test
device support. I think this is better -
--- a/x86/run
+++ b/x86/run
@@ -8,6 +8,7 @@ qemusystem=${QEMU:-qemu-system-x86_64}
 if  ! [ -z ${QEMU} ]
 then
qemu=${QEMU}
+   QEMUFOUND=1
 else
for qemucmds in ${qemukvm} ${qemusystem}
do

 + exit ${TESTDEVNOTSUPP}
 + fi
  fi
  
  if

 which apart from reindentation is just moving a fi down.
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Question regarding the reset value of LINT0

2015-04-08 Thread Bandan Das
Nadav Amit nadav.a...@gmail.com writes:

 Jan Kiszka jan.kis...@siemens.com wrote:

 On 2015-04-08 19:40, Nadav Amit wrote:
 Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-04-08 18:59, Nadav Amit wrote:
 Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-04-08 18:40, Nadav Amit wrote:
 Hi,
 
 I would appreciate if someone explains the reason for enabling LINT0 
 during
 APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local
 Vector Table” that says all LVT registers are reset to 0x1.
 
 In kvm_lapic_reset, I see:
 
 apic_set_reg(apic, APIC_LVT0,
 SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
 
 Which is actually pretty similar to QEMU’s apic_reset_common:
 
  if (bsp) {
  /*
   * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
   * time typically by BIOS, so PIC interrupt can be delivered to the
   * processor when local APIC is enabled.
   */
  s-lvt[APIC_LVT_LINT0] = 0x700;
  }
 
 Yet, in both cases, I miss the point - if it is typically done by the 
 BIOS,
 why does QEMU or KVM enable it?
 
 BTW: KVM seems to run fine without it, and I think setting it causes me
 problems in certain cases.
 
 I suspect it has some historic BIOS backgrounds. Already tried to find
 more information in the git logs of both code bases? Or something that
 indicates of SeaBIOS or BochsBIOS once didn't do this initialization?
 Thanks. I found no indication of such thing.
 
 QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says:
 
   Don't route PIC interrupts through the local APIC if the local APIC
   config says so. By Ari Kivity.
 
 Maybe Avi Kivity knows this guy.
 
 ths? That should have been Thiemo Seufer (IIRC), but he just committed
 the code back then (and is no longer with us, sadly).
 Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke
 about Avi knowing “Ari”).
 
 Ah. No problem. My brain apparently fixed that typo up unnoticed.
 
 But if that commit went in without any BIOS changes around it, QEMU
 simply had to do the job of the latter to keep things working.
 So should I leave it as is? Can I at least disable in KVM during INIT (and
 leave it as is for RESET)?
 
 No, I don't think there is a need to leave this inaccurate for QEMU if
 our included BIOS gets it right. I don't know what the backward
 bug-compatibility of KVM is, though. Maybe you can identify since when
 our BIOS is fine so that we can discuss time frames.

 I think that it was addressed in commit
 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the
 local APIC of the BSP.”) So it should be included in seabios 0.5.0, which
 means qemu 0.12 - so we are talking about the end of 2009 or start of 2010.

The probability that someone will use a newer version of kernel with something
as old as 0.12 is probably minimal. I think it's ok to change it with a comment
indicating the reason. To be on the safe side, however, a user changeable switch
is something worth considering.

 What is the verdict?

 Nadav--
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different

2015-04-07 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 07/04/2015 18:17, Bandan Das wrote:
  I think a bool argument is good enough.  QEMU has different functions,
  and init ends up doing save/reset/restore which is pretty ugly.
 
 Right, I meant that init could just be a wrapper so that it atleast shows up 
 in
 a backtrace - could be helpful for debugging.

 I suspect that the compiler would inline any sensible implementation and
 it wouldn't show up in the backtraces. :(

noinline ? :) Anyway, it's probably not worth the trouble, that could be easily
figured out.

Thanks,
Bandan

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different

2015-04-07 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 02/04/2015 04:17, Bandan Das wrote:
  x86 architecture defines differences between the reset and INIT sequences.
  INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, 
  PMU,
  MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and 
  BSP.
 
  EFER is supposed NOT to be reset according to the SDM, but leaving the 
  LMA and
  LME untouched causes failed VM-entry.  Therefore we reset EFER (although 
  it is
  unclear whether the rest of EFER bits should be reset).
 Thanks! This was actually in my todo list. #INIT and #RESET are actually 
 separate pins
 on the processor. So, shouldn't we differentiate between the two too by 
 having
 (*vcpu_init) and (*vcpu_reset) separate ?

 I think a bool argument is good enough.  QEMU has different functions,
 and init ends up doing save/reset/restore which is pretty ugly.

Right, I meant that init could just be a wrapper so that it atleast shows up in
a backtrace - could be helpful for debugging.

 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: x86: INIT and reset sequences are different

2015-04-01 Thread Bandan Das
Nadav Amit na...@cs.technion.ac.il writes:

 x86 architecture defines differences between the reset and INIT sequences.
 INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
 MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.

 EFER is supposed NOT to be reset according to the SDM, but leaving the LMA and
 LME untouched causes failed VM-entry.  Therefore we reset EFER (although it is
 unclear whether the rest of EFER bits should be reset).

Thanks! This was actually in my todo list. #INIT and #RESET are actually 
separate pins
on the processor. So, shouldn't we differentiate between the two too by having
(*vcpu_init) and (*vcpu_reset) separate ?

Bandan

 References (from Intel SDM):

 If the MP protocol has completed and a BSP is chosen, subsequent INITs 
 (either
 to a specific processor or system wide) do not cause the MP protocol to be
 repeated. [8.4.2: MP Initialization Protocol Requirements and Restrictions]

 [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]

 If the processor is reset by asserting the INIT# pin, the x87 FPU state is 
 not
 changed. [9.2: X87 FPU INITIALIZATION]

 The state of the local APIC following an INIT reset is the same as it is 
 after
 a power-up or hardware reset, except that the APIC ID and arbitration ID
 registers are not affected. [10.4.7.3: Local APIC State After an INIT Reset
 (“Wait-for-SIPI” State)]

 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
  arch/x86/include/asm/kvm_host.h |  6 +++---
  arch/x86/kvm/lapic.c| 11 ++-
  arch/x86/kvm/lapic.h|  2 +-
  arch/x86/kvm/svm.c  |  2 +-
  arch/x86/kvm/vmx.c  | 30 +-
  arch/x86/kvm/x86.c  | 17 ++---
  6 files changed, 38 insertions(+), 30 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index bf5a160..59f4374 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -701,7 +701,7 @@ struct kvm_x86_ops {
   /* Create, but do not attach this VCPU */
   struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id);
   void (*vcpu_free)(struct kvm_vcpu *vcpu);
 - void (*vcpu_reset)(struct kvm_vcpu *vcpu);
 + void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event);
  
   void (*prepare_guest_switch)(struct kvm_vcpu *vcpu);
   void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
 @@ -989,7 +989,7 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int 
 irq_source_id);
  
  void kvm_inject_nmi(struct kvm_vcpu *vcpu);
  
 -int fx_init(struct kvm_vcpu *vcpu);
 +int fx_init(struct kvm_vcpu *vcpu, bool init_event);
  
  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
  const u8 *new, int bytes);
 @@ -1134,7 +1134,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
  int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
  int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 -void kvm_vcpu_reset(struct kvm_vcpu *vcpu);
 +void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);
  void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm,
  unsigned long address);
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index bd4e34d..17da6fc 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -1534,7 +1534,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
 value)
  
  }
  
 -void kvm_lapic_reset(struct kvm_vcpu *vcpu)
 +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
  {
   struct kvm_lapic *apic;
   int i;
 @@ -1548,7 +1548,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
   /* Stop the timer in case it's a reset to an active apic */
   hrtimer_cancel(apic-lapic_timer.timer);
  
 - kvm_apic_set_id(apic, vcpu-vcpu_id);
 + if (!init_event)
 + kvm_apic_set_id(apic, vcpu-vcpu_id);
   kvm_apic_set_version(apic-vcpu);
  
   for (i = 0; i  APIC_LVT_NUM; i++)
 @@ -1689,7 +1690,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
   APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE);
  
   static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */
 - kvm_lapic_reset(vcpu);
 + kvm_lapic_reset(vcpu, false);
   kvm_iodevice_init(apic-dev, apic_mmio_ops);
  
   return 0;
 @@ -2023,8 +2024,8 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
   pe = xchg(apic-pending_events, 0);
  
   if (test_bit(KVM_APIC_INIT, pe)) {
 - kvm_lapic_reset(vcpu);
 - kvm_vcpu_reset(vcpu);
 + kvm_lapic_reset(vcpu, true);
 + kvm_vcpu_reset(vcpu, true);
   if (kvm_vcpu_is_bsp(apic-vcpu))
   vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE;
   else
 diff --git 

[PATCH] x86-run: Print a meaningful message if the qemu binary isn't found

2015-04-01 Thread Bandan Das

Before:
./x86-run ./x86/msr.flat
QEMU binary has no support for test device. Exiting.

After:
./x86-run ./x86/msr.flat
A QEMU binary was not found, You can set a custom
location by using the QEMU=path environment variable

Signed-off-by: Bandan Das b...@redhat.com
---
 x86/run | 37 ++---
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/x86/run b/x86/run
index af37eb4..d5adf8d 100755
--- a/x86/run
+++ b/x86/run
@@ -1,20 +1,35 @@
 #!/bin/bash
+NOTFOUND=1
+TESTDEVNOTSUPP=2
 
 qemukvm=${QEMU:-qemu-kvm}
 qemusystem=${QEMU:-qemu-system-x86_64}
-if
-   ${qemukvm} -device '?' 21 | grep -F -e \testdev\ -e \pc-testdev\ 
 /dev/null;
+
+if  ! [ -z ${QEMU} ]
 then
-   qemu=${qemukvm}
+   qemu=${QEMU}
 else
-   if
-   ${qemusystem} -device '?' 21 | grep -F -e \testdev\ -e 
\pc-testdev\  /dev/null;
-   then
-   qemu=${qemusystem}
-   else
-   echo QEMU binary ${QEMU} has no support for test device. 
Exiting.
-   exit 2
-   fi
+   for qemucmds in ${qemukvm} ${qemusystem}
+   do
+   unset QEMUFOUND
+   unset qemu
+   if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep 
QEMU)} ] 
+  ${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e 
\pc-testdev\  /dev/null;
+   then
+   qemu=${qemucmds}
+   break
+   fi
+   done
+fi
+
+if  [ -z ${QEMUFOUND} ]
+then
+   echo A QEMU binary was not found, You can set a custom location by 
using the QEMU=path environment variable 
+   exit ${NOTFOUND}
+elif[ -z ${qemu} ]
+then
+   echo No Qemu test device support found
+   exit ${TESTDEVNOTSUPP}
 fi
 
 if
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Bandan Das
Bandan Das b...@redhat.com writes:

 Andrey Korolyov and...@xdel.ru writes:
 ...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

 What happens if NMI is turned off on the host ?

Sorry, I meant the watchdog..
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Bandan Das
Andrey Korolyov and...@xdel.ru writes:
...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

What happens if NMI is turned off on the host ?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error upgrading vm from windows 8 to 8.1

2015-03-27 Thread Bandan Das
coreys cor...@affinitygs.com writes:

 When I try to upgrade my guest windows 8 vm to window 8.1. I get the
 error of processor doesn't support CompareExchange128.  Haven not been
 able to find any information about this error.

I think that's because cmpxchg16b is not emulated yet:

static int em_cmpxchg8b(struct x86_emulate_ctxt *ctxt)
{
u64 old = ctxt-dst.orig_val64;

if (ctxt-dst.bytes == 16)
return X86EMUL_UNHANDLEABLE;
...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Bandan Das
Radim Krčmář rkrc...@redhat.com writes:

 2015-03-26 21:24+0300, Andrey Korolyov:
 On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-26 20:08+0300, Andrey Korolyov:
  KVM internal error. Suberror: 2
  extra data[0]: 80ef
  extra data[1]: 8b0d
 
  Btw. does this part ever change?
 
  I see that first report had:
 
KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
 
  Was that a Windows guest by any chance?
 
 Yes, exactly, different extra data output was from a Windows VMs.

 Windows uses vector 0xd1 for timer interrupts.

 I second Bandan -- checking that it reproduces on other machine would be
 great for sanity :)  (Although a bug in our APICv is far more likely.)

If it's APICv related, a run without apicv enabled could give more hints.

Your devices not getting reset hypothesis makes the most sense to me,
maybe the timer vector in the error message is just one part of
the whole story. Another misbehaving interrupt from the dark comes in at the
same time and leads to a double fault.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Bandan Das
Hi Andrey,

Andrey Korolyov and...@xdel.ru writes:

 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.

I will take the risk and say this - could it be a processor bug ? :)


 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...

 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

My system is a Dell R720 dual socket which has 2620v2s. I tried your
setup but couldn't reproduce (my qemu cmdline isn't exactly the same
as yours), although, if you could simplify your command line a bit,
I can try again.

Bandan

 1)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0

 2)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0
 [...empty screen...]
 SeaBIOS (version 1.8.1-20150325_230423-testnode)
 Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1


 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10

 3)

 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d
 EAX= EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6d2c
 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6cb0 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
 ba 2d d4 fe fb 3f

 4)
 /usr/bin/qemu-system-x86_64 -name centos71 -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios
 /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp
 12,sockets=1,cores=12,threads=12 -uuid
 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config
 -nodefaults -device sga -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc
 base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
 -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global
 PIIX4_PM.disable_s4=1 -boot strict=on -device
 nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -chardev
 socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait
 -device 
 virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1
 -msg timestamp=on
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow networking with VM virtio

2015-03-24 Thread Bandan Das
[Ccing qemu-devel and kvm]

Boylan, Ross ross.boy...@ucsf.edu writes:

 Running a Windows 7 VM under KVM, with virtio drivers, the networking was so 
 slow as to be non-functional.
 As suggested in comment 11 of  
 https://bugzilla.redhat.com/show_bug.cgi?id=855640, I did ethtool -K eth1 gro 
 off and speeds got much better, ~150Mb/s, although using speedtest.net it was 
 unable to connect to the server for the upload test.

 However, ssh in to the host machine, or to linux VM's running on the same 
 host, became extremely erratic (prone to dropped connections) and slow after 
 making these changes.  I was connected via sshuttle for VPN.

 Does anyone have any suggestions?
 I would appreciate a cc.

 Thanks.
 Ross Boylan

 Details:
 Host is running Debian wheezy on amd64.  Using virt-manager  libvirt  kvm 
 and bridged networking.
 Linux markov00 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
 Here's one of the network specifications for libvirt:
 interface type='direct'
   mac address='52:54:00:61:'/
   source dev='eth1' mode='vepa'/
   model type='virtio'/
   address type='pci' domain='0x' bus='0x00' slot='0x03' 
 function='0x0'/
/interface

 The Windows drivers came from 
 http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/virtio-win-0.1-100.iso
  in the 32 bit windows 7 directory of the iso (host is 64 bit, as is the 
 emulated machine; Windows 7 is 32 bit). --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Bandan Das
Jan Kiszka jan.kis...@web.de writes:
...
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
 *vmx)
   vmx-nested.nested_vmx_secondary_ctls_low = 0;
   vmx-nested.nested_vmx_secondary_ctls_high =
   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 + SECONDARY_EXEC_RDTSCP |
   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
   SECONDARY_EXEC_APIC_REGISTER_VIRT |
   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 @@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
 *vcpu)
   return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
   case EXIT_REASON_RDPMC:
   return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
 - case EXIT_REASON_RDTSC:
 + case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
   return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
   case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
   case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
 @@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
   exec_control);
   }
   }
 + if (!vmx-rdtscp_enabled)
 + vmx-nested.nested_vmx_secondary_ctls_high =
 + ~SECONDARY_EXEC_RDTSCP;
No need to do this if nested is not enabled ? Or just
a if (nested) in the prior if else loop should be enough I think.

Bandan
   }
  
   /* Exposing INVPCID only when PCID is exposed */
 @@ -9146,8 +9150,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   exec_control = ~SECONDARY_EXEC_RDTSCP;
   /* Take the following fields only from vmcs12 */
   exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_RDTSCP |
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 -  SECONDARY_EXEC_APIC_REGISTER_VIRT);
 +   SECONDARY_EXEC_APIC_REGISTER_VIRT);
   if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
   exec_control |= vmcs12-secondary_vm_exec_control;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Bandan Das
Jan Kiszka jan.kis...@web.de writes:

 On 2015-03-23 18:01, Bandan Das wrote:
 Jan Kiszka jan.kis...@web.de writes:
 ...
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct 
 vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_RDTSCP |
 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_APIC_REGISTER_VIRT |
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 @@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
 *vcpu)
 return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
 case EXIT_REASON_RDPMC:
 return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
 -   case EXIT_REASON_RDTSC:
 +   case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
 return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
 case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
 case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
 @@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 exec_control);
 }
 }
 +   if (!vmx-rdtscp_enabled)
 +   vmx-nested.nested_vmx_secondary_ctls_high =
 +   ~SECONDARY_EXEC_RDTSCP;
 No need to do this if nested is not enabled ? Or just
 a if (nested) in the prior if else loop should be enough I think.

 I can add this - but this is far away from being a hotpath. What would
 be the benefit?

Right, definitely not a hotpath, just seems unnecessary if nested is not 
enabled.

Bandan
 Thanks,
 Jan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)

2015-03-18 Thread Bandan Das
[Ccing netdev and Stefan]
Bandan Das b...@redhat.com writes:

 jacob jacob opstk...@gmail.com writes:

 On Mon, Mar 16, 2015 at 2:12 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

 I also see the following in dmesg in the VM.

 [0.095758] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-ff])
 [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed,
 disabling PCIe ASPM
 [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC
 support mask: 0x08)
 IIRC, For OSC control, after BIOS is done with (whatever initialization
 it needs to do), it clears a bit so that the OS can take over. This message,
 you are getting is a sign of a bug in the BIOS (usually). But I don't
 know if this is related to your problem. Does dmesg | grep -e DMAR -e 
 IOMMU
 give anything useful ?

 Do not see anything useful in the output..

 Ok, Thanks. Can you please post the output as well ?

 [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information,
 can't access extended PCI configuration space under this bridge.

 Does this indicate any issue related to PCI passthrough?

 Would really appreciate any input on how to bebug this further.

 Did you get a chance to try a newer kernel ?
 Currently am using 3.18.7-200.fc21.x86_64 which is pretty recent.
 Are you suggesting trying the newer kernel just on the host? (or VM too?)
 Both preferably to 3.19. But it's just a wild guess. I saw i40e related fixes,
 particularly i40e: fix un-necessary Tx hangs in 3.19-rc5. This is not 
 exactly
 what you are seeing but I was still wondering if it could help.

Actually, Stefan suggests that support for this card is still sketchy
and your best bet is to try out net-next
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git

Also, could you please post more information about your hardware setup
(chipset/processor/firmware version on the card etc) ?

Thanks,
Bandan

 Meanwhile, I am trying to get hold of a card myself to try and reproduce
 it at my end.

 Thanks,
 Bandan

 On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote:
 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.


 I tried with the latest DPDK release too (dpdk-1.8.0) and see the same 
 issue.
 As mentioned earlier, i do not see any issues at all when running
 tests using either i40e or dpdk on the host itself.
 This is the reason why i am suspecting if it is anything to do with 
 KVM/libvirt.
 Both with regular PCI passthrough and VF passthrough i see issues. It
 is always pointing to some issue with packet transmission. Receive
 seems to work ok.


 On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

 On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

  Hi,

  Seeing failures when trying to do PCI passthrough of Intel XL710 40G
 interface to KVM vm.
  0a:00.1 Ethernet controller: Intel Corporation Ethernet
 Controller XL710 for 40GbE QSFP+ (rev 01)

 You are assigning the PF right ? Does assigning VFs work or it's
 the same behavior ?

 Yes.Assigning VFs worked ok.But this had other issues while bringing 
 down VMs.
 Interested in finding out if PCI passthrough of 40G intel XL710
 interface is qualified in some specific kernel/kvm release.

 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.

 From dmesg on host:

 [80326.559674] kvm: zapping shadow pages for mmio generation 
 wraparound
 [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9
 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6
 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7
 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6
 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606

 These are harmless and are related to unimplemented PMU msrs,
 not VFIO.

 Bandan
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: call irq notifiers with directed EOI

2015-03-18 Thread Bandan Das
Radim Krčmář rkrc...@redhat.com writes:

 kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
 We need to do that for irq notifiers.  (Like with edge interrupts.)

 Fix it by skipping EOI broadcast only.

 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  arch/x86/kvm/ioapic.c | 4 +++-
  arch/x86/kvm/lapic.c  | 3 +--
  2 files changed, 4 insertions(+), 3 deletions(-)

 diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
 index b1947e0f3e10..46d4449772bc 100644
 --- a/arch/x86/kvm/ioapic.c
 +++ b/arch/x86/kvm/ioapic.c
 @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   struct kvm_ioapic *ioapic, int vector, int trigger_mode)
  {
   int i;
 + struct kvm_lapic *apic = vcpu-arch.apic;
  
   for (i = 0; i  IOAPIC_NUM_PINS; i++) {
   union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i];
 @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i);
   spin_lock(ioapic-lock);
  
 - if (trigger_mode != IOAPIC_LEVEL_TRIG)
 + if (trigger_mode != IOAPIC_LEVEL_TRIG ||
 + kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI)
   continue;
  
   ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG);
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index bd4e34de24c7..4ee827d7bf36 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
 kvm_vcpu *vcpu2)
  
  static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
  {
 - if (!(kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI) 
 - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
 + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
   int trigger_mode;
   if (apic_test_vector(vector, apic-regs + APIC_TMR))
   trigger_mode = IOAPIC_LEVEL_TRIG;

Works on my Xen 4.4 L1 setup with Intel E5 v2 host. Without this patch,
L1 panics as reported in the bug referenced above.

Tested-by: Bandan Dasb...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: call irq notifiers with directed EOI

2015-03-18 Thread Bandan Das
Radim Krčmář rkrc...@redhat.com writes:

 kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
 We need to do that for irq notifiers.  (Like with edge interrupts.)

Wow! It's interesting that this path is only hit with Xen as guest.
I always thought of directed EOI as a security feature since broadcast
could lead to interrupt storms (or something like that) :)

Bandan

 Fix it by skipping EOI broadcast only.

 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  arch/x86/kvm/ioapic.c | 4 +++-
  arch/x86/kvm/lapic.c  | 3 +--
  2 files changed, 4 insertions(+), 3 deletions(-)

 diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
 index b1947e0f3e10..46d4449772bc 100644
 --- a/arch/x86/kvm/ioapic.c
 +++ b/arch/x86/kvm/ioapic.c
 @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   struct kvm_ioapic *ioapic, int vector, int trigger_mode)
  {
   int i;
 + struct kvm_lapic *apic = vcpu-arch.apic;
  
   for (i = 0; i  IOAPIC_NUM_PINS; i++) {
   union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i];
 @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i);
   spin_lock(ioapic-lock);
  
 - if (trigger_mode != IOAPIC_LEVEL_TRIG)
 + if (trigger_mode != IOAPIC_LEVEL_TRIG ||
 + kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI)
   continue;
  
   ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG);
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index bd4e34de24c7..4ee827d7bf36 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
 kvm_vcpu *vcpu2)
  
  static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
  {
 - if (!(kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI) 
 - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
 + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
   int trigger_mode;
   if (apic_test_vector(vector, apic-regs + APIC_TMR))
   trigger_mode = IOAPIC_LEVEL_TRIG;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM emulation failure with recent kernel and QEMU Seabios

2015-03-17 Thread Bandan Das
Gerd Hoffmann kra...@redhat.com writes:

 On Mo, 2015-03-16 at 14:16 -0400, Bandan Das wrote:
 Jan Kiszka jan.kis...@siemens.com writes:
 
  Am 2015-03-12 um 09:11 schrieb Gerd Hoffmann:
  On Do, 2015-03-12 at 09:09 +0100, Jan Kiszka wrote:
  Hi,
 
  apparently since the latest QEMU updates I'm getting this once in a
  while:
  
  http://www.seabios.org/pipermail/seabios/2015-March/008897.html
 
  OK... So we are waiting on a stable release to pull in new binaries with
  this fix? A matter of days?
 
 1.8.1 is now available on the downloads page.

 I know, I've uploaded it ;)

Yep, I believe you ;) I copied this information here from one of your other
emails for Jan's benefit :)

 Pull request for the update was sent yesterday too.
 It's merged meanwhile, so go fetch qemu master and be happy.

 cheers,
   Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)

2015-03-16 Thread Bandan Das
jacob jacob opstk...@gmail.com writes:

 I also see the following in dmesg in the VM.

 [0.095758] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-ff])
 [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed,
 disabling PCIe ASPM
 [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC
 support mask: 0x08)
IIRC, For OSC control, after BIOS is done with (whatever initialization
it needs to do), it clears a bit so that the OS can take over. This message,
you are getting is a sign of a bug in the BIOS (usually). But I don't
know if this is related to your problem. Does dmesg | grep -e DMAR -e IOMMU
give anything useful ?

 [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information,
 can't access extended PCI configuration space under this bridge.

 Does this indicate any issue related to PCI passthrough?

 Would really appreciate any input on how to bebug this further.

Did you get a chance to try a newer kernel ?

 On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote:
 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.


 I tried with the latest DPDK release too (dpdk-1.8.0) and see the same issue.
 As mentioned earlier, i do not see any issues at all when running
 tests using either i40e or dpdk on the host itself.
 This is the reason why i am suspecting if it is anything to do with 
 KVM/libvirt.
 Both with regular PCI passthrough and VF passthrough i see issues. It
 is always pointing to some issue with packet transmission. Receive
 seems to work ok.


 On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

 On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

  Hi,

  Seeing failures when trying to do PCI passthrough of Intel XL710 40G
 interface to KVM vm.
  0a:00.1 Ethernet controller: Intel Corporation Ethernet
 Controller XL710 for 40GbE QSFP+ (rev 01)

 You are assigning the PF right ? Does assigning VFs work or it's
 the same behavior ?

 Yes.Assigning VFs worked ok.But this had other issues while bringing down 
 VMs.
 Interested in finding out if PCI passthrough of 40G intel XL710
 interface is qualified in some specific kernel/kvm release.

 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.

 From dmesg on host:

 [80326.559674] kvm: zapping shadow pages for mmio generation wraparound
 [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9
 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6
 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7
 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6
 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606

 These are harmless and are related to unimplemented PMU msrs,
 not VFIO.

 Bandan
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM emulation failure with recent kernel and QEMU Seabios

2015-03-16 Thread Bandan Das
Jan Kiszka jan.kis...@siemens.com writes:

 Am 2015-03-12 um 09:11 schrieb Gerd Hoffmann:
 On Do, 2015-03-12 at 09:09 +0100, Jan Kiszka wrote:
 Hi,

 apparently since the latest QEMU updates I'm getting this once in a
 while:
 
 http://www.seabios.org/pipermail/seabios/2015-March/008897.html

 OK... So we are waiting on a stable release to pull in new binaries with
 this fix? A matter of days?

1.8.1 is now available on the downloads page.

Bandan

 Jan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: SVM: Fix confusing message if no exit handlers are installed

2015-03-16 Thread Bandan Das

I hit this path on a AMD box and thought
someone was playing a April Fool's joke on me.

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cc618c8..4e77110 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3555,7 +3555,7 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
if (exit_code = ARRAY_SIZE(svm_exit_handlers)
|| !svm_exit_handlers[exit_code]) {
-   WARN_ONCE(1, vmx: unexpected exit reason 0x%x\n, exit_code);
+   WARN_ONCE(1, svm: unexpected exit reason 0x%x\n, exit_code);
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] PCI passthrough of 40G ethernet interface (Openstack/KVM)

2015-03-16 Thread Bandan Das
jacob jacob opstk...@gmail.com writes:

 On Mon, Mar 16, 2015 at 2:12 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

 I also see the following in dmesg in the VM.

 [0.095758] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-ff])
 [0.096006] acpi PNP0A03:00: ACPI _OSC support notification failed,
 disabling PCIe ASPM
 [0.096915] acpi PNP0A03:00: Unable to request _OSC control (_OSC
 support mask: 0x08)
 IIRC, For OSC control, after BIOS is done with (whatever initialization
 it needs to do), it clears a bit so that the OS can take over. This message,
 you are getting is a sign of a bug in the BIOS (usually). But I don't
 know if this is related to your problem. Does dmesg | grep -e DMAR -e IOMMU
 give anything useful ?

 Do not see anything useful in the output..

Ok, Thanks. Can you please post the output as well ?

 [0.097072] acpi PNP0A03:00: fail to add MMCONFIG information,
 can't access extended PCI configuration space under this bridge.

 Does this indicate any issue related to PCI passthrough?

 Would really appreciate any input on how to bebug this further.

 Did you get a chance to try a newer kernel ?
 Currently am using 3.18.7-200.fc21.x86_64 which is pretty recent.
 Are you suggesting trying the newer kernel just on the host? (or VM too?)
Both preferably to 3.19. But it's just a wild guess. I saw i40e related fixes,
particularly i40e: fix un-necessary Tx hangs in 3.19-rc5. This is not exactly
what you are seeing but I was still wondering if it could help.

Meanwhile, I am trying to get hold of a card myself to try and reproduce
it at my end.

Thanks,
Bandan

 On Fri, Mar 13, 2015 at 10:08 AM, jacob jacob opstk...@gmail.com wrote:
 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.


 I tried with the latest DPDK release too (dpdk-1.8.0) and see the same 
 issue.
 As mentioned earlier, i do not see any issues at all when running
 tests using either i40e or dpdk on the host itself.
 This is the reason why i am suspecting if it is anything to do with 
 KVM/libvirt.
 Both with regular PCI passthrough and VF passthrough i see issues. It
 is always pointing to some issue with packet transmission. Receive
 seems to work ok.


 On Thu, Mar 12, 2015 at 8:02 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

 On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

  Hi,

  Seeing failures when trying to do PCI passthrough of Intel XL710 40G
 interface to KVM vm.
  0a:00.1 Ethernet controller: Intel Corporation Ethernet
 Controller XL710 for 40GbE QSFP+ (rev 01)

 You are assigning the PF right ? Does assigning VFs work or it's
 the same behavior ?

 Yes.Assigning VFs worked ok.But this had other issues while bringing 
 down VMs.
 Interested in finding out if PCI passthrough of 40G intel XL710
 interface is qualified in some specific kernel/kvm release.

 So, it could be the i40e driver then ? Because IIUC, VFs use a separate
 driver. Just to rule out the possibility that there might be some driver 
 fixes that
 could help with this, it might be a good idea to try a 3.19 or later 
 upstream
 kernel.

 From dmesg on host:

 [80326.559674] kvm: zapping shadow pages for mmio generation 
 wraparound
 [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9
 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6
 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7
 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6
 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606

 These are harmless and are related to unimplemented PMU msrs,
 not VFIO.

 Bandan
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: svm: remove SVM_EXIT_READ_CR* intercepts

2015-03-12 Thread Bandan Das

Joel Schopp joel.sch...@amd.com writes:

 There isn't really a valid reason for kvm to intercept cr* reads
 on svm hardware.  The current kvm code just ends up returning
 the register

 Signed-off-by: Joel Schopp joel.sch...@amd.com
 ---
  arch/x86/kvm/svm.c |   41 -
  1 file changed, 4 insertions(+), 37 deletions(-)

 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index cc618c8..c3d10e6 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1090,9 +1090,6 @@ static void init_vmcb(struct vcpu_svm *svm)
   svm-vcpu.fpu_active = 1;
   svm-vcpu.arch.hflags = 0;
  
 - set_cr_intercept(svm, INTERCEPT_CR0_READ);
 - set_cr_intercept(svm, INTERCEPT_CR3_READ);
 - set_cr_intercept(svm, INTERCEPT_CR4_READ);
   set_cr_intercept(svm, INTERCEPT_CR0_WRITE);
   set_cr_intercept(svm, INTERCEPT_CR3_WRITE);
   set_cr_intercept(svm, INTERCEPT_CR4_WRITE);
 @@ -1174,7 +1171,6 @@ static void init_vmcb(struct vcpu_svm *svm)
   control-nested_ctl = 1;
   clr_intercept(svm, INTERCEPT_INVLPG);
   clr_exception_intercept(svm, PF_VECTOR);
 - clr_cr_intercept(svm, INTERCEPT_CR3_READ);
   clr_cr_intercept(svm, INTERCEPT_CR3_WRITE);
   save-g_pat = 0x0007040600070406ULL;
   save-cr3 = 0;
 @@ -2968,29 +2964,10 @@ static int cr_interception(struct vcpu_svm *svm)
   kvm_queue_exception(svm-vcpu, UD_VECTOR);
   return 1;
   }
 - } else { /* mov from cr */
 - switch (cr) {
 - case 0:
 - val = kvm_read_cr0(svm-vcpu);
 - break;
 - case 2:
 - val = svm-vcpu.arch.cr2;
 - break;
 - case 3:
 - val = kvm_read_cr3(svm-vcpu);
 - break;
 - case 4:
 - val = kvm_read_cr4(svm-vcpu);
 - break;
 - case 8:
 - val = kvm_get_cr8(svm-vcpu);
 - break;
 - default:
 - WARN(1, unhandled read from CR%d, cr);
 - kvm_queue_exception(svm-vcpu, UD_VECTOR);
 - return 1;
 - }
 - kvm_register_write(svm-vcpu, reg, val);
 + } else { /* mov from cr, should never trap in svm */
 + WARN(1, unhandled read from CR%d, cr);
 + kvm_queue_exception(svm-vcpu, UD_VECTOR);
 + return 1;

Can we end up here if a nested hypervisor sets cr read interception ?

Bandan

   }
   kvm_complete_insn_gp(svm-vcpu, err);
  
 @@ -3321,10 +3298,6 @@ static int mwait_interception(struct vcpu_svm *svm)
  }
  
  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 - [SVM_EXIT_READ_CR0] = cr_interception,
 - [SVM_EXIT_READ_CR3] = cr_interception,
 - [SVM_EXIT_READ_CR4] = cr_interception,
 - [SVM_EXIT_READ_CR8] = cr_interception,
   [SVM_EXIT_CR0_SEL_WRITE]= emulate_on_interception,
   [SVM_EXIT_WRITE_CR0]= cr_interception,
   [SVM_EXIT_WRITE_CR3]= cr_interception,
 @@ -4151,11 +4124,9 @@ static const struct __x86_intercept {
   u32 exit_code;
   enum x86_intercept_stage stage;
  } x86_intercept_map[] = {
 - [x86_intercept_cr_read] = POST_EX(SVM_EXIT_READ_CR0),
   [x86_intercept_cr_write]= POST_EX(SVM_EXIT_WRITE_CR0),
   [x86_intercept_clts]= POST_EX(SVM_EXIT_WRITE_CR0),
   [x86_intercept_lmsw]= POST_EX(SVM_EXIT_WRITE_CR0),
 - [x86_intercept_smsw]= POST_EX(SVM_EXIT_READ_CR0),
   [x86_intercept_dr_read] = POST_EX(SVM_EXIT_READ_DR0),
   [x86_intercept_dr_write]= POST_EX(SVM_EXIT_WRITE_DR0),
   [x86_intercept_sldt]= POST_EX(SVM_EXIT_LDTR_READ),
 @@ -4221,10 +4192,6 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
   goto out;
  
   switch (icpt_info.exit_code) {
 - case SVM_EXIT_READ_CR0:
 - if (info-intercept == x86_intercept_cr_read)
 - icpt_info.exit_code += info-modrm_reg;
 - break;
   case SVM_EXIT_WRITE_CR0: {
   unsigned long cr0, val;
   u64 intercept;

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI passthrough of 40G ethernet interface (Openstack/KVM)

2015-03-12 Thread Bandan Das
jacob jacob opstk...@gmail.com writes:

  Hi,

  Seeing failures when trying to do PCI passthrough of Intel XL710 40G
 interface to KVM vm.
  0a:00.1 Ethernet controller: Intel Corporation Ethernet
 Controller XL710 for 40GbE QSFP+ (rev 01)

You are assigning the PF right ? Does assigning VFs work or it's
the same behavior ?

 From dmesg on host:

 [80326.559674] kvm: zapping shadow pages for mmio generation wraparound
 [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9
 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6
 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7
 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6
 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606

These are harmless and are related to unimplemented PMU msrs,
not VFIO.

Bandan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI passthrough of 40G ethernet interface (Openstack/KVM)

2015-03-12 Thread Bandan Das
jacob jacob opstk...@gmail.com writes:

 On Thu, Mar 12, 2015 at 3:07 PM, Bandan Das b...@redhat.com wrote:
 jacob jacob opstk...@gmail.com writes:

  Hi,

  Seeing failures when trying to do PCI passthrough of Intel XL710 40G
 interface to KVM vm.
  0a:00.1 Ethernet controller: Intel Corporation Ethernet
 Controller XL710 for 40GbE QSFP+ (rev 01)

 You are assigning the PF right ? Does assigning VFs work or it's
 the same behavior ?

 Yes.Assigning VFs worked ok.But this had other issues while bringing down VMs.
 Interested in finding out if PCI passthrough of 40G intel XL710
 interface is qualified in some specific kernel/kvm release.

So, it could be the i40e driver then ? Because IIUC, VFs use a separate
driver. Just to rule out the possibility that there might be some driver fixes 
that
could help with this, it might be a good idea to try a 3.19 or later upstream
kernel.

 From dmesg on host:

 [80326.559674] kvm: zapping shadow pages for mmio generation wraparound
 [80327.271191] kvm [175994]: vcpu0 unhandled rdmsr: 0x1c9
 [80327.271689] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a6
 [80327.272201] kvm [175994]: vcpu0 unhandled rdmsr: 0x1a7
 [80327.272681] kvm [175994]: vcpu0 unhandled rdmsr: 0x3f6
 [80327.376186] kvm [175994]: vcpu0 unhandled rdmsr: 0x606

 These are harmless and are related to unimplemented PMU msrs,
 not VFIO.

 Bandan
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das
Kevin O'Connor ke...@koconnor.net writes:
...

 Something is very odd here.  When I run the above command (on an older
 AMD machine) I get:

 Found 128 cpu(s) max supported 128 cpu(s)

 That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
 That is, during smp init, SeaBIOS expects QEMU to tell it how many
 cpus are active, and SeaBIOS waits until that many CPUs check in from
 its SIPI request before proceeding.

 I wonder if QEMU reported only 1 active cpu via that cmos register,
 but more were actually active.  If that was the case, it could

I was daring enough to try this and I don't see the crash :)

diff --git a/src/fw/smp.c b/src/fw/smp.c
index a466ea6..a346d46 100644
--- a/src/fw/smp.c
+++ b/src/fw/smp.c
@@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id)
 void VISIBLE32FLAT
 handle_smp(void)
 {
+  dprintf(DEBUG_HDL_smp, Calling handle_smp\n);
 if (!CONFIG_QEMU)
 return;
 
@@ -128,6 +129,8 @@ smp_setup(void)
 
 // Wait for other CPUs to process the SIPI.
 u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
+while (cmos_smp_count == 1)
+cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
 while (cmos_smp_count != CountCPUs)
 asm volatile(
 // Release lock and allow other processors to use the stack.

So, the while loop results in a race somehow ?

Bandan
 certainly explain the failure - as multiple cpus could be running
 without the sipi trapoline in place.

 What does the log look like on a non-failure case?

 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das
Kevin O'Connor ke...@koconnor.net writes:

 On Wed, Mar 11, 2015 at 01:09:42PM -0400, Bandan Das wrote:
 Kevin O'Connor ke...@koconnor.net writes:
 ...
 
  Something is very odd here.  When I run the above command (on an older
  AMD machine) I get:
 
  Found 128 cpu(s) max supported 128 cpu(s)
 
  That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
  That is, during smp init, SeaBIOS expects QEMU to tell it how many
  cpus are active, and SeaBIOS waits until that many CPUs check in from
  its SIPI request before proceeding.
 
  I wonder if QEMU reported only 1 active cpu via that cmos register,
  but more were actually active.  If that was the case, it could
 
 I was daring enough to try this and I don't see the crash :)
 
 diff --git a/src/fw/smp.c b/src/fw/smp.c
 index a466ea6..a346d46 100644
 --- a/src/fw/smp.c
 +++ b/src/fw/smp.c
 @@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id)
  void VISIBLE32FLAT
  handle_smp(void)
  {
 +  dprintf(DEBUG_HDL_smp, Calling handle_smp\n);
  if (!CONFIG_QEMU)
  return;
  
 @@ -128,6 +129,8 @@ smp_setup(void)
  
  // Wait for other CPUs to process the SIPI.
  u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
 +while (cmos_smp_count == 1)
 +cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;

 That would loop forever if you had -smp 1.

Sorry, I should have been more clear. What I meant is if I run
with -smp 128 (from Dave's original reproducer), sticking this
while loop avoids the crash. So, the rtc_read eventually returns the
right number (127), as the above while loop keeps polling.

Bandan

  while (cmos_smp_count != CountCPUs)
  asm volatile(
  // Release lock and allow other processors to use the stack.
 
 So, the while loop results in a race somehow ?

 No, the problem is that loop doesn't run at all, and as a result the
 other cpus end up running random code.  SeaBIOS sets up an entry
 vector for multiple cpus, wakes up the cpus, then tears down the entry
 vector.  If it tears down the entry vector before all the cpus have
 run through it, then the other cpus can end up running random code.

 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das
Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
  On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
   For what it's worth, I can't seem to trigger the problem if I move the
   cmos read above the SIPI/LAPIC code (see patch below).
  
  Ugh!
  
  That's a seabios bug.  Main processor modifies the rtc index
  (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
  index (romlayout.S:transition32).
  
  I'll put together a fix.
 
 The seabios patch below resolves the issue for me.

 Thanks! Looks good here.

 Andrey, Paolo, Bandan: Does it fix it for you as well?

Works for me too, thanks Kevin!

Bandan

 Dave

 -Kevin
 
 
 --- a/src/romlayout.S
 +++ b/src/romlayout.S
 @@ -22,7 +22,8 @@
  // %edx = return location (in 32bit mode)
  // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
  DECLFUNC transition32
 -transition32_for_smi:
 +transition32_nmi_off:
 +// transition32 when NMI and A20 are already initialized
  movl %eax, %ecx
  jmp 1f
  transition32:
 @@ -205,7 +206,7 @@ __farcall16:
  entry_smi:
  // Transition to 32bit mode.
  movl $1f + BUILD_BIOS_ADDR, %edx
 -jmp transition32_for_smi
 +jmp transition32_nmi_off
  .code32
  1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
  calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
 @@ -216,8 +217,10 @@ entry_smi:
  DECLFUNC entry_smp
  entry_smp:
  // Transition to 32bit mode.
 +cli
 +cld
  movl $2f + BUILD_BIOS_ADDR, %edx
 -jmp transition32
 +jmp transition32_nmi_off
  .code32
  // Acquire lock and take ownership of shared stack
  1:  rep ; nop
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das
Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
   So, I couldn't get this to fail on my older AMD machine at all with
   the default SeaBIOS code.  But, when I change the code with the patch
   below, it failed right away.
 [...]
   And the failed debug output looks like:
   
   SeaBIOS (version 
   rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
   [...]
   cmos_smp_count0=20
   [...]
   cmos_smp_count=1
   cmos_smp_count2=1/20
   Found 1 cpu(s) max supported 20 cpu(s)
   
   I'm going to check the assembly for a compiler error, but is it
   possible QEMU is returning incorrect data in cmos index 0x5f?
 
 I checked the SeaBIOS assembler and it looks sane.  So, I think the
 question is, why is QEMU sometimes returning a 0 instead of 127 from
 cmos 0x5f.

 My reading of the logs I've just created is that qemu doesn't think
 it's ever being asked to read 5f in the failed case:

 good:

 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80
 cmos: read index=0x5f val=0x13  Yeh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00

 bad:
 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80  Oh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00

rtc_read in hw/rt.c is:
u8
rtc_read(u8 index)
{
index |= NMI_DISABLE_BIT;
outb(index, PORT_CMOS_INDEX);
return inb(PORT_CMOS_DATA);
}

Is it possible that the read would happen before the index has been committed ?

 Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >