Re: Cross vendor migration ideas

Amit Shah Fri, 14 Nov 2008 05:07:46 -0800

* On Thursday 13 Nov 2008 19:08:14 Alexander Graf wrote:
> On 13.11.2008, at 05:35, Amit Shah wrote:
> > * On Wednesday 12 Nov 2008 22:49:16 Alexander Graf wrote:
> >> On 12.11.2008, at 17:52, Amit Shah wrote:
> >>> Hi Alex,
> >>>
> >>> * On Wednesday 12 Nov 2008 21:09:43 Alexander Graf wrote:
> >>>> Hi,
> >>>>
> >>>> I was thinking a bit about cross vendor migration recently and
> >>>> since
> >>>> we're doing open source development, I figured it might be a good
> >>>> idea
> >>>> to talk to everyone about this.
> >>>>
> >>>> So why are we having a problem?
> >>>>
> >>>> In normal operation we don't. If we're running a 32-bit kernel, we
> >>>> can
> >>>> use SYSENTER to jump from kernel<->userspace. If we're on a 64-bit
> >>>> kernel with 64-bit userspace, every CPU supports SYSCALL. At least
> >>>> Linux is being smart on this and does use exactly these two
> >>>> capabilities in these two cases.
> >>>> But if we're running in compat mode (64-bit kernel with 32-bit
> >>>> userspace), things differ. Intel supports only SYSENTER here, while
> >>>> AMD only supports SYSCALL. Both can still use int80.
> >>>>
> >>>> Operating systems detect usage of SYSCALL or SYSENTER pretty
> >>>> early on
> >>>> (Linux does this on vdso). So when we boot up on an Intel machine,
> >>>> Linux assumes that using SYSENTER in compat mode is fine. Migrating
> >>>> that machine to an AMD machine breaks this assumption though, since
> >>>> SYSENTER can't be used in compat mode.
> >>>> On LInux, this detection is based on the CPU vendor string. If
> >>>> Linux
> >>>> finds a "GenuineIntel", SYSENTER is used in compat mode, if it's
> >>>> "AuthenticAMD", SYSCALL is used and if none of these two is found,
> >>>> int80 is used.
> >>>>
> >>>> I tried modifying the vendor string, removed the "overwrite the
> >>>> vendor
> >>>> string with the native string" hack and things look like they work
> >>>> just fine with Linux.
> >>>>
> >>>> Unfortunately right now I don't have a 64-bit Windows installation
> >>>> around to check if that approach works there too, but if it does
> >>>> and
> >>>> no known OS breaks due to the invalid vendor string, we can just
> >>>> create our own virtual CPU string, no?
> >>>
> >>> qemu has an option for that, -cpu qemu64 IIRC. As long as we expose
> >>> practically correct cpuids and MSRs, this should be fine. I've not
> >>> tested
> >>> qemu64 with winxp x64 though. Also, last I knew, winxp x64
> >>> installation
> >>> didn't succeed with --no-kvm. qemu by default exposes an AMD CPU
> >>> type.
> >>
> >> I wasn't talking about CPUID features, but the vendor string. Qemu64
> >> provides the AuthenticAMD string, so we don't run into any issues I'm
> >> presuming.
> >
> > Right -- the thing is, with the default AuthenticAMD string, winp x64
> > installation fails. That has to be because of some missing cpuids.
> > That's one
> > of the drawbacks of exposing a well-known CPU type. I was suggesting
> > we
> > should try out the -cpu qemu64 CPU type since it exposes a non-
> > standard CPU
> > to see if guests and most userspace programs work fine without any
> > further
> > tweaking -- see the 'cons' below for why this might be a problem.
>
> I still don't really understand what you're trying to say - qemu64 is
> the default in KVM right now. You mean winxp64 installation doesn't


No, the default for KVM is the host CPU type.

> work as is and we should fix it? This has nothing to do with the
> migration problems, right?

Solutions shouldn't involve adding known regressions. If our default cpu type 
changes to one that renders some of the OSes we support right now to become 
nonfunctional, such changes won't be accepted. Of course, we can improve the 
qemu64 cpu type to ensure the popular OS types work properly at the least.

> >>> There are pros and cons to expose a custom vendor ID:
> >>>
> >>> pros:
> >>> - We don't need to have all the cpuid features exposed which are
> >>> expected of a
> >>> physically available CPU in the market, for example, badly-coded
> >>> applications
> >>> might crash if we don't have SSSE3 on a Core2Duo. But badly-coded or
> >>> not, not
> >>> exposing what's actually available on every C2D out there is bad.
> >>>
> >>> cons:
> >>> - To expose the "correct" set of feature bits for a known processor,
> >>> we also
> >>> need to check the family/model/stepping to support the exact same
> >>> feature
> >>> bits that were present in the CPU.
> >>> - We might not get some optimizations that OSes might have based on
> >>> CPU type,
> >>> even if the host CPU qualifies for such optimizations
> >>> - Standard programs like benchmarking tools, etc., might fail if
> >>> they depend
> >>> on the vendor string for their functionality
> >>>
> >>> For 32-bit guests, I think exposing a pentium4 or Athlon CPU type
> >>> should be
> >>> fine. For 64-bit guests, the newer the better.
> >>
> >> Well, we could create different CPU definitions:
> >>
> >> - migration safe (do what is safe for migration)
> >
> > There are multiple ways of approaching this: peg to a least-known
> > good CPU
> > type, all of whose instructions will work on processors from both
> > the major
> > vendors. However, you never know how the server pools change and
> > you'd want
> > to upgrade the CPU type once you know the CPUs that are installed in
> > servers.
> > This has to be dynamic and the management application has to take
> > care of
> > exposing a CPU that's of a "safe" type for the particular server
> > pool. We
> > have to provide ways to mask off CPUID bits as requested by the
> > management
> > application. (Each server sends its cpuid to the management
> > application,
> > which calculates the safest bits and then conveys this to each
> > server before
> > starting a VM.)
>
> IMHO we shouldn't really start to be smart here. There's only so much
> benefit in using the least common dominator between all CPUs in the
> datacenter vs. using the least common dominator between all possible
> CPUs. You'll basically end up enabling some newer SSE instructions.

I'm just saying the management application will do it. So it'll be local to 
the server pool the management app caters to. Not a common denominator for 
all deployments.

> So I don't think we need to go through the hassle of making this
> dynamic. If you want to migrate your machines - use the migrate
> preset. That won't give you the 150% speed boost on video encoding,
> but should not really be any slower on normal workloads. It does make
> things a lot more transparent to us and the admin of a network though,
> because you know what you'll end up with "-cpu migration".

We hardly know what uses KVM will be put to. Server virtualisation, desktop 
virtualisation, combination, what not. If we provide with the flexibility to 
the admin to tune as necessary, it's not a bad option at all. All the 
userspace needs is one tool that can calculate the max. features supported by 
the current CPU and send it over to the management app when asked for. The 
management app does the rest. KVM is not involved at all.

> >> - CPU specific (like a Core2Duo, necessary to run Mac OS X)
> >
> > This doesn't need any more work -- we already have the ability to
> > select CPU
> > types. If the management application has knowledge of the kind of OS
> > being
> > installed in a VM (which these days is true), exposing a Core2Duo
> > for a
> > Mac-based OS isn't difficult.
>
> There is no sysenter emulation for IA-32e on AMD yet, right? That's
> the only issue I see here and your emulation patch should address that.

As I've mentioned before, I've not yet been able to test my patch because I've 
not found the sysenter/sysexit calls being used at all. It's included at the 
end of this mail for review; hopefully someone finds a use-case and we can 
take it forward.

> >> - host (fastest possible, but no migration)
> >
> > This should be the default.
>
> I'm not sure. Either host or migration should be the default. This

I'm suggesting that 'host' should be default. Where do we disagree?

> actually depends on the workload you have on KVM. For servers you'll
> probably want to have migration be the default. For desktop usage it's
> host. I can't think of a way we can be smart about that on the KVM
> level.

For individual runs from the command line, I'd prefer the host to be the 
default. For a wider deployment, the management app will set the defaults as 
necessary (admin-chosen).

> >> I don't think we could find one definition that fits all, so the user
> >> would have to define what the usage pattern will be.
> >>
> >>>> I'd love to hear comments and suggestions on this and hope we'll
> >>>> end
> >>>> up in a fruitful discussion on how to improve the current
> >>>> situation.
> >>>
> >>> I have a patch ready for emulating sysenter/sysexit on AMD systems
> >>> (needs
> >>> testing). Patching the guest was an option that was discouraged; I
> >>> had a hack
> >>> ready but it was quickly shelved (again, untested).
> >>
> >> That sounds useful for misbehaving guests or cases I haven't thought
> >> of yet. Are you sure you're intercepting the SYSENTER MSRs on AMD, so
> >> you don't end up only getting 32 bits?
> >
> > Can you elaborate?
>
> When you write to MSR_IA32_SYSENTER_EIP on AMD, that MSR will be
> directly passed through to the hardware (search for that MSR in
> svm.c). This is because SVM automatically writes the SYSENTER MSRs to
> the SYSENTER fields in the VMCB.

My patch just handles the case when a sysenter is attempted on a system which 
doesn't have that instruction. So I just emulate it. Accessing the MSR and 
setting values is done at boot-time by the OS, and any migrations at that 
instant is a corner case and not too critical.

Now, the patch.

>From e1b760d8e596811081c282484621b49c674f1c22 Mon Sep 17 00:00:00 2001
From: Amit Shah <[EMAIL PROTECTED]>
Date: Wed, 12 Nov 2008 11:31:05 +0530
Subject: [PATCH] KVM: SVM: Emulate SYSENTER/SYSEXIT on AMD processors

This patch enables emulation of the sysenter/sysexit instructions in
AMD long mode. This will enable a guest started on an Intel machine to
be migrated to an AMD machine.

Signed-off-by: Amit Shah <[EMAIL PROTECTED]>
---
 arch/x86/kvm/svm.c         |   13 ++++
 arch/x86/kvm/x86_emulate.c |  137 
+++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 149 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f0ad4d4..4e6e1dc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1155,6 +1155,19 @@ static int vmmcall_interception(struct vcpu_svm *svm, 
struct kvm_run *kvm_run)
 static int invalid_op_interception(struct vcpu_svm *svm,
                                   struct kvm_run *kvm_run)
 {
+       /*
+        * If we're running in long mode on x86_64, check if we can
+        * emulate sysenter / sysexit
+        */
+       if (!is_long_mode(&svm->vcpu))
+               goto out;
+
+       if (emulate_instruction(&svm->vcpu, NULL, 0, 0, 0) == EMULATE_DONE) {
+               /* We could emulate it. */
+               return 1;
+       }
+
+ out:
        kvm_queue_exception(&svm->vcpu, UD_VECTOR);
        return 1;
 }
diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 8f60ace..c8afd20 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -205,7 +205,9 @@ static u16 twobyte_table[256] = {
        ModRM | ImplicitOps, ModRM, ModRM | ImplicitOps, ModRM, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x30 - 0x3F */
-       ImplicitOps, 0, ImplicitOps, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+       ImplicitOps, 0, ImplicitOps, 0,
+       ImplicitOps, ImplicitOps, 0, 0,
+       0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x40 - 0x47 */
        DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
        DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
@@ -305,8 +307,11 @@ static u16 group2_table[] = {
 };
 
 /* EFLAGS bit definitions. */
+#define EFLG_VM (1<<17)
+#define EFLG_RF (1<<16)
 #define EFLG_OF (1<<11)
 #define EFLG_DF (1<<10)
+#define EFLG_IF (1<<9)
 #define EFLG_SF (1<<7)
 #define EFLG_ZF (1<<6)
 #define EFLG_AF (1<<4)
@@ -1959,6 +1964,136 @@ twobyte_insn:
                rc = X86EMUL_CONTINUE;
                c->dst.type = OP_NONE;
                break;
+       case 0x34: { /* sysenter */
+               /* Vol 2b */
+               unsigned long cr0 = ctxt->vcpu->arch.cr0;
+               struct kvm_segment cs, ss;
+               u64 data;
+
+               if (cr0 & X86_CR0_PE) {
+                       kvm_inject_gp(ctxt->vcpu, 0);
+                       goto cannot_emulate;
+               }
+
+               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_CS, &data);
+               if (!(data & 0xFFFC)) {
+                       kvm_inject_gp(ctxt->vcpu, 0);
+                       goto cannot_emulate;
+               }
+
+               ctxt->eflags &= ~(EFLG_VM | EFLG_IF | EFLG_RF);
+
+               kvm_x86_ops->get_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
+               cs.selector = (__u16) data;
+               cs.base = 0;
+               cs.limit = 0xfffff;
+               cs.g = 1;
+               cs.s = 1;
+               cs.type = 0x0b;
+               cs.db = 1;
+               cs.dpl = 0;
+               cs.selector &= ~SELECTOR_RPL_MASK;
+               cs.present = 1;
+               /* The CPL should be set to 0 */
+
+               if (ctxt->mode == X86EMUL_MODE_PROT64) {
+                       cs.l = 1;
+                       cs.limit = 0xffffffff;
+               }
+
+               ss.selector = cs.selector + 8;
+               ss.base = 0;
+               ss.limit = 0xfffff;
+               ss.g = 1;
+               ss.s = 1;
+               ss.type = 0x03;
+               ss.db = 1;
+               ss.dpl = 0;
+               ss.selector &= ~SELECTOR_RPL_MASK;
+               ss.present = 1;
+               if (ctxt->mode == X86EMUL_MODE_PROT64) {
+                       ss.limit = 0xffffffff;
+               }
+
+               kvm_x86_ops->set_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
+               kvm_x86_ops->set_segment(ctxt->vcpu, &ss, VCPU_SREG_SS);
+
+               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_EIP, &data);
+               c->eip = data;
+
+               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_ESP, &data);
+               c->regs[VCPU_REGS_RSP] = data;
+
+               goto writeback;
+               break;
+       }
+       case 0x35: { /* sysexit */
+               /* Vol 2b */
+               u64 data;
+               unsigned long cr0 = ctxt->vcpu->arch.cr0;
+               struct kvm_segment cs, ss;
+
+               if (cr0 & X86_CR0_PE) {
+                       kvm_inject_gp(ctxt->vcpu, 0);
+                       goto cannot_emulate;
+               }
+
+               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_CS, &data);
+               if (!(data & 0xFFFC) ||
+                   ((ctxt->mode == X86EMUL_MODE_PROT64) && !data)) {
+                       kvm_inject_gp(ctxt->vcpu, 0);
+                       goto cannot_emulate;
+               }
+
+               /* Check if CPL is 0. If not, inject_gp */
+
+               kvm_x86_ops->get_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
+               cs.selector = (u16)(data +
+                                   (ctxt->mode == X86EMUL_MODE_PROT64 ? 32 : 
16));
+               cs.base = 0;
+               cs.limit = 0xfffff;
+               cs.g = 1;
+               cs.s = 1;
+               cs.type = 0x0b;
+               cs.db = 1;
+               cs.dpl = 3;
+               cs.selector |= SELECTOR_RPL_MASK;
+               cs.present = 1;
+               cs.l = 0; /* For return to compatibility mode */
+               /* The CPL should be set to 3 */
+
+               if (ctxt->mode == X86EMUL_MODE_PROT64) {
+                       cs.l = 1;
+                       /* The manual doesn't talk about CS limit */
+               }
+
+               ss.selector = cs.selector +
+                       (ctxt->mode == X86EMUL_MODE_PROT64 ? 16 : 8);
+               ss.base = 0;
+               ss.limit = 0xfffff;
+               ss.g = 1;
+               ss.s = 1;
+               ss.type = 0x03;
+               ss.db = 1;
+               ss.dpl = 3;
+               ss.selector |= SELECTOR_RPL_MASK;
+               ss.present = 1;
+               if (ctxt->mode == X86EMUL_MODE_PROT64) {
+                       ss.base = 0;
+                       ss.limit = 0xffffffff;
+               }
+
+               kvm_x86_ops->set_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
+               kvm_x86_ops->set_segment(ctxt->vcpu, &ss, VCPU_SREG_SS);
+
+               c->eip = ctxt->vcpu->arch.regs[VCPU_REGS_RDX];
+               c->regs[VCPU_REGS_RSP] = c->regs[VCPU_REGS_RCX];
+
+               /* TODO: Check if rip and rsp are canonical. inject_gp() if not 
*/
+
+               goto writeback;
+               break;
+       }
        case 0x40 ... 0x4f:     /* cmov */
                c->dst.val = c->dst.orig_val = c->src.val;
                if (!test_cc(c->b, ctxt->eflags))
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cross vendor migration ideas

Reply via email to