date:20130716

Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 09:07:53AM +0530, Raghavendra K T wrote:
 On 07/15/2013 04:06 PM, Gleb Natapov wrote:
 On Mon, Jul 15, 2013 at 03:20:06PM +0530, Raghavendra K T wrote:
 On 07/14/2013 06:42 PM, Gleb Natapov wrote:
 On Mon, Jun 24, 2013 at 06:13:42PM +0530, Raghavendra K T wrote:
 kvm : Paravirtual ticketlocks support for linux guests running on KVM 
 hypervisor
 
 From: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 
 trimming
 [...]
 +
 +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t 
 want)
 +{
 + struct kvm_lock_waiting *w;
 + int cpu;
 + u64 start;
 + unsigned long flags;
 +
 + w = __get_cpu_var(lock_waiting);
 + cpu = smp_processor_id();
 + start = spin_time_start();
 +
 + /*
 +  * Make sure an interrupt handler can't upset things in a
 +  * partially setup state.
 +  */
 + local_irq_save(flags);
 +
 + /*
 +  * The ordering protocol on this is that the lock pointer
 +  * may only be set non-NULL if the want ticket is correct.
 +  * If we're updating want, we must first clear lock.
 +  */
 + w-lock = NULL;
 + smp_wmb();
 + w-want = want;
 + smp_wmb();
 + w-lock = lock;
 +
 + add_stats(TAKEN_SLOW, 1);
 +
 + /*
 +  * This uses set_bit, which is atomic but we should not rely on its
 +  * reordering gurantees. So barrier is needed after this call.
 +  */
 + cpumask_set_cpu(cpu, waiting_cpus);
 +
 + barrier();
 +
 + /*
 +  * Mark entry to slowpath before doing the pickup test to make
 +  * sure we don't deadlock with an unlocker.
 +  */
 + __ticket_enter_slowpath(lock);
 +
 + /*
 +  * check again make sure it didn't become free while
 +  * we weren't looking.
 +  */
 + if (ACCESS_ONCE(lock-tickets.head) == want) {
 + add_stats(TAKEN_SLOW_PICKUP, 1);
 + goto out;
 + }
 +
 + /* Allow interrupts while blocked */
 + local_irq_restore(flags);
 +
 So what happens if an interrupt comes here and an interrupt handler
 takes another spinlock that goes into the slow path? As far as I see
 lock_waiting will become overwritten and cpu will be cleared from
 waiting_cpus bitmap by nested kvm_lock_spinning(), so when halt is
 called here after returning from the interrupt handler nobody is going
 to wake this lock holder. Next random interrupt will fix it, but it
 may be several milliseconds away, or never. We should probably check
 if interrupt were enabled and call native_safe_halt() here.
 
 
 Okay you mean something like below should be done.
 if irq_enabled()
native_safe_halt()
 else
halt()
 
 It is been a complex stuff for analysis for me.
 
 So in our discussion stack would looking like this.
 
 spinlock()
kvm_lock_spinning()
-- interrupt here
halt()
 
 
  From the halt if we trace
 
 It is to early to trace the halt since it was not executed yet. Guest
 stack trace will look something like this:
 
 spinlock(a)
kvm_lock_spinning(a)
 lock_waiting = a
 set bit in waiting_cpus
  -- interrupt here
  spinlock(b)
kvm_lock_spinning(b)
  lock_waiting = b
  set bit in waiting_cpus
  halt()
  unset bit in waiting_cpus
  lock_waiting = NULL
   -- ret from interrupt
 halt()
 
 Now at the time of the last halt above lock_waiting == NULL and
 waiting_cpus is empty and not interrupt it pending, so who will unhalt
 the waiter?
 
 
 Yes. if an interrupt occurs between
 local_irq_restore() and halt(), this is possible. and since this is
 rarest of rare (possiility of irq entering slowpath and then no
 random irq to do spurious wakeup), we had never hit this problem in
 the past.
I do not think it is very rare to get interrupt between
local_irq_restore() and halt() under load since any interrupt that
occurs between local_irq_save() and local_irq_restore() will be delivered
immediately after local_irq_restore(). Of course the chance of no other
random interrupt waking lock waiter is very low, but waiter can sleep
for much longer then needed and this will be noticeable in performance.

BTW can NMI handler take spinlocks? If it can what happens if NMI is
delivered in a section protected by local_irq_save()/local_irq_restore()?

 
 So I am,
 1. trying to artificially reproduce this.
 
 2. I replaced the halt with below code,
if (arch_irqs_disabled())
 halt();
 
 and ran benchmarks.
 But this results in degradation because, it means we again go back
 and spin in irq enabled case.
 
Yes, this is not what I proposed.

 3. Now I am analyzing the performance overhead of safe_halt in irq
 enabled case.
   if (arch_irqs_disabled())
halt();
   else
safe_halt();
Use of arch_irqs_disabled() is incorrect here. If you are doing it before
local_irq_restore() it will always be false since you disabled interrupt
yourself, if you do it after then it is to late since interrupt can come
between local_irq_restore() and

Re: [PATCH 4/4] kvm: Emulate MOVBE

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 12:41:15AM +0200, Borislav Petkov wrote:
 Fix kvm ML already.
 
 On Tue, Jul 16, 2013 at 12:39:30AM +0200, Borislav Petkov wrote:
  On Sun, Jul 07, 2013 at 01:18:27PM +0300, Gleb Natapov wrote:
   KVM does not enables emulation because QEMU called
   KVM_GET_EMULATED_CPUID. KVM enables emulation because QEMU sets
   MOVBE bit in a guest visible cpuid using KVM_SET_CPUID2. QEMU does
   that either because host and cpu model QEMU uses both have it, or
   because user asked for it specifically and KVM reports that it can
   be emulated. Why emulator should care what is the reason MOVBE is
   enabled?
  
  Ok, I think I know what you mean:
  
  diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
  index 6fe5a838116a..c963ff8e43dd 100644
  --- a/arch/x86/kvm/emulate.c
  +++ b/arch/x86/kvm/emulate.c
  @@ -3147,10 +3147,18 @@ static int em_mov(struct x86_emulate_ctxt *ctxt)
  return X86EMUL_CONTINUE;
   }
   
  +#undef F
  +#define F(x) bit(X86_FEATURE_##x)
  +
   static int em_movbe(struct x86_emulate_ctxt *ctxt)
   {
  +   u32 ebx, ecx, edx, eax = 1;
  u16 tmp;
   
  +   ctxt-ops-get_cpuid(ctxt, eax, ebx, ecx, edx);
  +   if (!(ecx  F(MOVBE)))
  +   return X86EMUL_PROPAGATE_FAULT;
  +
  switch (ctxt-op_bytes) {
  case 2:
  /*
  ---
  
  Right?
 
Right, just replace return X86EMUL_PROPAGATE_FAULT; with return 
emulate_ud(ctxt);

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 07:09, Naor Shlomo ha scritto:
 Hi,
 
 I am trying to disable the mergeable rx buffers for a Linux guest but I am 
 currently unable to do so.
 
 I tried looking in the code of libvirt in order to find the -global 
 virtio-net-pci.mrg_rxbuf=off but the only globals configured are:
 PIIX4_PM.disable_s*, isa-fdc.drive, isa-fdc.bootindex, ram_size and vram_size
 
 Is there a way to configure the XML configuration for the domain and disable 
 the mergeable buffers? 
 If not, is there a way to somehow do it from the guest? (I don't want to mess 
 with the host)

You can use something like this:

  qemu:commandline
qemu:arg value='-global'/
qemu:env name='mrg_rxbuf' value='off'/
  /qemu:commandline

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] booke: define reset and shutdown hcalls

2013-07-16 Thread Gleb Natapov

On Mon, Jul 15, 2013 at 01:17:33PM -0500, Scott Wood wrote:
 On 07/15/2013 06:30:20 AM, Gleb Natapov wrote:
 On Mon, Jul 15, 2013 at 04:41:17PM +0530, Bharat Bhushan wrote:
  KVM_HC_VM_RESET: Requests that the virtual machine be reset.
  KVM_HC_VM_SHUTDOWN: Requests that the virtual machine be
 powered-off/halted.
 
  These hcalls are handled by guest userspace.
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
   Documentation/virtual/kvm/hypercalls.txt |   16 
   include/uapi/linux/kvm_para.h|3 ++-
   2 files changed, 18 insertions(+), 1 deletions(-)
 
  diff --git a/Documentation/virtual/kvm/hypercalls.txt
 b/Documentation/virtual/kvm/hypercalls.txt
  index ea113b5..58acdc1 100644
  --- a/Documentation/virtual/kvm/hypercalls.txt
  +++ b/Documentation/virtual/kvm/hypercalls.txt
  @@ -64,3 +64,19 @@ Purpose: To enable communication between the
 hypervisor and guest there is a
   shared page that contains parts of supervisor visible register
 state.
   The guest can map this shared page to access its supervisor
 register through
   memory using this hypercall.
  +
  +5. KVM_HC_VM_RESET
  +
  +Architecture: PPC
  +Status: active
  +Purpose:  Requests that the virtual machine be reset.  The
 hcall takes no
  +arguments. If successful the hcall does not return. If an error
 occurs it
  +returns EV_INTERNAL.
  +
  +6. KVM_HC_VM_SHUTDOWN
  +
  +Architecture: PPC
  +Status: active
  +Purpose: Requests that the virtual machine be powered-off/halted.
  +The hcall takes no arguments. If successful the hcall does not
 return.
  +If an error occurs it returns EV_INTERNAL.
  diff --git a/include/uapi/linux/kvm_para.h
 b/include/uapi/linux/kvm_para.h
  index cea2c5c..218882d 100644
  --- a/include/uapi/linux/kvm_para.h
  +++ b/include/uapi/linux/kvm_para.h
  @@ -19,7 +19,8 @@
   #define KVM_HC_MMU_OP 2
   #define KVM_HC_FEATURES   3
   #define KVM_HC_PPC_MAP_MAGIC_PAGE 4
  -
  +#define KVM_HC_VM_RESET   5
  +#define KVM_HC_VM_SHUTDOWN6
 There is no much sense to share hypercalls between architectures.
 There
 is zero probability x86 will implement those for instance
 
 This is similar to the question of whether to keep device API
 enumerations per-architecture...  It costs very little to keep it in
 a common place, and it's hard to go back in the other direction if
 we later realize there are things that should be shared.

This is different from device API since with device API all arches have
to create/destroy devices, so it make sense to put device lifecycle
management into the common code, and device API has single entry point
to the code - device fd ioctl - where it makes sense to handle common
tasks, if any, and despatch others to specific device implementation.

This is totally unlike hypercalls which are, by definition, very
architecture specific (the way they are triggered, the way parameter
are passed from guest to host, what hypercalls arch needs...). The
entry point of hypercalls is in arch specific code (again unlike device
API), so they way to reuse code if need arise is different too and does
not require common name space - just call common function after
retrieving hypercalls parameters in arch specific way.

 Keeping it in a common place also makes it more visible to people
 looking to add new hcalls, which could cut down on reinventing the
 wheel.
I do not want other arches to start using hypercalls in the way powerpc
started to use them: separate device io space, so it is better to hide
this as far away from common code as possible :) But on a more serious
note hypercalls should be a last resort and added only when no other
possibility exists, so people should not look what hcalls others
implemented, so they can add them to their favorite arch, but they
should have a problem at hand that they cannot solve without hcall, but
at this point they will have pretty good idea what this hcall should do.

 
 (not sure why PPC will want them either instead of emulating
 devices that do
 shutdown/reset).
 
 Besides what Alex said, for shutdown we don't have any existing
 device to emulate (our real hardware just doesn't have that
 functionality).  For reset we currently do emulate, but it's awkward
 to describe in the device tree what we actually emulate since the
 reset functionality is part of a kitchen-sink device of which we
 emulate virtually nothing other than the reset.  Currently we
 advertise the entire thing and just ignore the rest, but that causes
 problems with the guest seeing the node and trying to use that
 functionality.
 
What about writing virtio device for shutdown and add missing emulation
to kitchen-sink device (yeah I know easily said that done), or make
the virtio device handle reset too? This of course raises the question
what address to use for a device hence the idea to use hcalls as
separate address space.
 
--

RE: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Naor Shlomo

Hi Paolo and thanks for your quick reply,

I tried editing (virsh edit) the domain's XML and put the XML excerpt you gave 
me everywhere but with no success.
The moment I exit the edit mode the text was gone (I guess it didn't pass some 
sort of sanity and that's why it was automatically erased).

What am I doing wrong?

Naor

-Original Message-
From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo Bonzini
Sent: Tuesday, July 16, 2013 9:12 AM
To: Naor Shlomo
Cc: kvm@vger.kernel.org
Subject: Re: Disabling mergeable rx buffers for the guest

Il 16/07/2013 07:09, Naor Shlomo ha scritto:
 Hi,
 
 I am trying to disable the mergeable rx buffers for a Linux guest but I am 
 currently unable to do so.
 
 I tried looking in the code of libvirt in order to find the -global 
 virtio-net-pci.mrg_rxbuf=off but the only globals configured are:
 PIIX4_PM.disable_s*, isa-fdc.drive, isa-fdc.bootindex, ram_size and vram_size
 
 Is there a way to configure the XML configuration for the domain and disable 
 the mergeable buffers? 
 If not, is there a way to somehow do it from the guest? (I don't want to mess 
 with the host)

You can use something like this:

  qemu:commandline
qemu:arg value='-global'/
qemu:env name='mrg_rxbuf' value='off'/
  /qemu:commandline

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 08:40, Naor Shlomo ha scritto:
 Hi Paolo and thanks for your quick reply,
 
 I tried editing (virsh edit) the domain's XML and put the XML excerpt you 
 gave me everywhere but with no success.
 The moment I exit the edit mode the text was gone (I guess it didn't pass 
 some sort of sanity and that's why it was automatically erased).
 
 What am I doing wrong?

My fault.  You need to change the domain opening tag to

domain type='qemu' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'

See http://libvirt.org/drvqemu.html#qemucommand for the docs.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Naor Shlomo

Thanks again Paolo,

I used your string and read the documents in the site you referred me to but 
could not understand why doesn't it accept the 
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' string.

I tried it on the following version:
Compiled against library: libvir 0.9.8
Using library: libvir 0.9.8
Using API: QEMU 0.9.8
Running hypervisor: QEMU 1.0.0

So according to the site it should be supported.

Any idea what am I missing now?

Naor

-Original Message-
From: Paolo Bonzini [mailto:pbonz...@redhat.com] 
Sent: Tuesday, July 16, 2013 9:42 AM
To: Naor Shlomo
Cc: kvm@vger.kernel.org
Subject: Re: Disabling mergeable rx buffers for the guest

Il 16/07/2013 08:40, Naor Shlomo ha scritto:
 Hi Paolo and thanks for your quick reply,
 
 I tried editing (virsh edit) the domain's XML and put the XML excerpt you 
 gave me everywhere but with no success.
 The moment I exit the edit mode the text was gone (I guess it didn't pass 
 some sort of sanity and that's why it was automatically erased).
 
 What am I doing wrong?

My fault.  You need to change the domain opening tag to

domain type='qemu' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'

See http://libvirt.org/drvqemu.html#qemucommand for the docs.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

This is the first version for VMX nested environment test case. It
contains the basic VMX instructions test cases, including VMXON/
VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
also tests the basic execution routine in VMX nested environment and
let the VM print Hello World to inform its successfully run.

New files added:
x86/vmx.h : contains all VMX related macro declerations
x86/vmx.c : main file for VMX nested test case

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86-common.mak |2 +
 config-x86_64.mak |1 +
 lib/x86/msr.h |5 +
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  568 +
 x86/vmx.h |  406 +++
 7 files changed, 992 insertions(+)
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

diff --git a/config-x86-common.mak b/config-x86-common.mak
index 455032b..34a41e1 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 
 $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
+$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/config-x86_64.mak b/config-x86_64.mak
index 4e525f5..bb8ee89 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
  $(TEST_DIR)/pcid.flat
 tests += $(TEST_DIR)/svm.flat
+tests += $(TEST_DIR)/vmx.flat
 
 include config-x86-common.mak
diff --git a/lib/x86/msr.h b/lib/x86/msr.h
index 509a421..281255a 100644
--- a/lib/x86/msr.h
+++ b/lib/x86/msr.h
@@ -396,6 +396,11 @@
 #define MSR_IA32_VMX_VMCS_ENUM  0x048a
 #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
 #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
+#define MSR_IA32_VMX_TRUE_PIN  0x048d
+#define MSR_IA32_VMX_TRUE_PROC 0x048e
+#define MSR_IA32_VMX_TRUE_EXIT 0x048f
+#define MSR_IA32_VMX_TRUE_ENTRY0x0490
+
 
 /* AMD-V MSRs */
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 24df5f8..0fe76da 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -4,6 +4,10 @@
 .globl boot_idt
 boot_idt = 0
 
+.globl idt_descr
+.globl tss_descr
+.globl gdt64_desc
+
 ipi_vector = 0x20
 
 max_cpus = 64
diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index bc9643e..e846739 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
 file = pcid.flat
 extra_params = -cpu qemu64,+pcid
 arch = x86_64
+
+[vmx]
+file = vmx.flat
+extra_params = -cpu Nehalem,+vmx
+arch = x86_64
+
diff --git a/x86/vmx.c b/x86/vmx.c
new file mode 100644
index 000..0435746
--- /dev/null
+++ b/x86/vmx.c
@@ -0,0 +1,568 @@
+#include libcflat.h
+#include processor.h
+#include vm.h
+#include desc.h
+#include vmx.h
+#include msr.h
+#include smp.h
+#include io.h
+
+
+int fails = 0, tests = 0;
+u32 *vmxon_region;
+struct vmcs *vmcs_root;
+void *io_bmp1, *io_bmp2;
+void *msr_bmp;
+u32 vpid_ctr;
+char *guest_stack, *host_stack;
+char *guest_syscall_stack, *host_syscall_stack;
+u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
+ulong fix_cr0_set, fix_cr0_clr;
+ulong fix_cr4_set, fix_cr4_clr;
+struct regs regs;
+
+extern u64 gdt64_desc[];
+extern u64 idt_descr[];
+extern u64 tss_descr[];
+extern void *entry_vmx;
+extern void *entry_sysenter;
+extern void *entry_guest;
+
+void report(const char *name, int result)
+{
+   ++tests;
+   if (result)
+   printf(PASS: %s\n, name);
+   else {
+   printf(FAIL: %s\n, name);
+   ++fails;
+   }
+}
+
+inline u64 get_rflags(void)
+{
+   u64 r;
+   asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
+   return r;
+}
+
+inline void set_rflags(u64 r)
+{
+   asm volatile(push %0; popf\n\t : : q(r) : cc);
+}
+
+int vmcs_clear(struct vmcs *vmcs)
+{
+   bool ret;
+   asm volatile (vmclear %1; seta %0 : =q (ret) : m (vmcs) : cc);
+   return !ret;
+}
+
+u64 vmcs_read(enum Encoding enc)
+{
+   u64 val;
+   asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
+   return val;
+}
+
+int vmcs_write(enum Encoding enc, u64 val)
+{
+   bool ret;
+   asm volatile (vmwrite %1, %2; seta %0
+   : =q(ret) : rm (val), r ((u64)enc) : cc);
+   return !ret;
+}
+
+int make_vmcs_current(struct vmcs *vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrld %1; seta %0 : =q (ret) : m (vmcs) : cc);
+   return !ret;
+}
+
+int save_vmcs(struct vmcs **vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrst %1; seta %0 : =q (ret) : m (*vmcs) : cc);
+   return !ret;
+}
+
+/* entry_vmx */
+asm(
+   .align 4, 0x90\n\t
+

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

Hi there,
This is a version calling for comments. Some minor changes should be
done before final commitment (TODOs in it), because these places are
related to the bugs I have commited in the previous weeks and the
relevant patches are not accpeted. But these bugs are all about some
unexpected occasions and the hypervisor can run if we simply ignore
them. After all bugs fixed, everything will be OK.

Arthur

On Tue, Jul 16, 2013 at 5:27 PM, Arthur Chunqi Li yzt...@gmail.com wrote:
 This is the first version for VMX nested environment test case. It
 contains the basic VMX instructions test cases, including VMXON/
 VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
 also tests the basic execution routine in VMX nested environment and
 let the VM print Hello World to inform its successfully run.

 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  568 
 +
  x86/vmx.h |  406 +++
  7 files changed, 992 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h

 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o

  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o

 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
 $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
 $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
   $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
   $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat

  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN  0x048d
 +#define MSR_IA32_VMX_TRUE_PROC 0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT 0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY0x0490
 +

  /* AMD-V MSRs */

 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0

 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20

  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..e846739 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu Nehalem,+vmx
 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..0435746
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,568 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +void *io_bmp1, *io_bmp2;
 +void *msr_bmp;
 +u32 vpid_ctr;
 +char *guest_stack, *host_stack;
 +char *guest_syscall_stack, *host_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *entry_vmx;
 +extern void *entry_sysenter;
 +extern void *entry_guest;
 +
 +void report(const char *name, int result)
 +{
 +   ++tests;
 +   if (result)
 +   printf(PASS: %s\n, name);
 +   else {
 +   printf(FAIL: %s\n, name);
 +   ++fails;
 +   }
 +}
 +
 +inline u64 get_rflags(void)
 +{
 +   u64 r;
 +   asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
 +   return r;
 +}
 +
 +inline void set_rflags(u64 r)
 +{
 +   asm volatile(push %0; popf\n\t : : q(r) : cc);
 +}
 +
 +int vmcs_clear(struct vmcs *vmcs)
 +{
 +   bool ret;
 +   asm volatile (vmclear %1; seta %0 : =q (ret) : m (vmcs) : cc);
 +   return !ret;
 +}
 +
 +u64 vmcs_read(enum Encoding enc)
 +{
 +   u64 val;
 +   asm volatile (vmread

Re: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 10:06, Naor Shlomo ha scritto:
 Thanks again Paolo,
 
 I used your string and read the documents in the site you referred me to but 
 could not understand why doesn't it accept the 
 xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' string.
 
 I tried it on the following version:
 Compiled against library: libvir 0.9.8
 Using library: libvir 0.9.8
 Using API: QEMU 0.9.8
 Running hypervisor: QEMU 1.0.0
 
 So according to the site it should be supported.
 
 Any idea what am I missing now?

Not sure.  Can you post here the full XML, and the one you're trying to use?

Paolo

 Naor
 
 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com] 
 Sent: Tuesday, July 16, 2013 9:42 AM
 To: Naor Shlomo
 Cc: kvm@vger.kernel.org
 Subject: Re: Disabling mergeable rx buffers for the guest
 
 Il 16/07/2013 08:40, Naor Shlomo ha scritto:
 Hi Paolo and thanks for your quick reply,

 I tried editing (virsh edit) the domain's XML and put the XML excerpt you 
 gave me everywhere but with no success.
 The moment I exit the edit mode the text was gone (I guess it didn't pass 
 some sort of sanity and that's why it was automatically erased).

 What am I doing wrong?
 
 My fault.  You need to change the domain opening tag to
 
 domain type='qemu' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'
 
 See http://libvirt.org/drvqemu.html#qemucommand for the docs.
 
 Paolo
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 05:35:16PM +0800, Arthur Chunqi Li wrote:
 Hi there,
 This is a version calling for comments. Some minor changes should be
Add RFC before PATCH for such submission please.

 done before final commitment (TODOs in it), because these places are
 related to the bugs I have commited in the previous weeks and the
 relevant patches are not accpeted. But these bugs are all about some
I am aware of only one patch. Did I miss/forgot something?

 unexpected occasions and the hypervisor can run if we simply ignore
 them. After all bugs fixed, everything will be OK.
 
 Arthur
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

On Tue, Jul 16, 2013 at 5:45 PM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jul 16, 2013 at 05:35:16PM +0800, Arthur Chunqi Li wrote:
 Hi there,
 This is a version calling for comments. Some minor changes should be
 Add RFC before PATCH for such submission please.

 done before final commitment (TODOs in it), because these places are
 related to the bugs I have commited in the previous weeks and the
 relevant patches are not accpeted. But these bugs are all about some
 I am aware of only one patch. Did I miss/forgot something?
I have only commit one patch related to VMX tests. Other patches cited
here is about the bug fixes such as
[http://www.mail-archive.com/kvm@vger.kernel.org/msg92932.html]
[http://www.mail-archive.com/kvm@vger.kernel.org/msg93046.html], which
will cause some test cases fail in this patch.

I have been discussing with Jan in the past weeks because most of the
questions are technical affairs, which I don't think it's suitable to
discuss in the community. This is a rather mature version and I think
I can commit it with some minor changes.

 unexpected occasions and the hypervisor can run if we simply ignore
 them. After all bugs fixed, everything will be OK.

 Arthur


 --
 Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 05:53:56PM +0800, Arthur Chunqi Li wrote:
 On Tue, Jul 16, 2013 at 5:45 PM, Gleb Natapov g...@redhat.com wrote:
  On Tue, Jul 16, 2013 at 05:35:16PM +0800, Arthur Chunqi Li wrote:
  Hi there,
  This is a version calling for comments. Some minor changes should be
  Add RFC before PATCH for such submission please.
 
  done before final commitment (TODOs in it), because these places are
  related to the bugs I have commited in the previous weeks and the
  relevant patches are not accpeted. But these bugs are all about some
  I am aware of only one patch. Did I miss/forgot something?
 I have only commit one patch related to VMX tests. Other patches cited
 here is about the bug fixes such as
 [http://www.mail-archive.com/kvm@vger.kernel.org/msg92932.html]
 [http://www.mail-archive.com/kvm@vger.kernel.org/msg93046.html], which
 will cause some test cases fail in this patch.
 
Those are applied to queue already. I am aware of one that is not
applied yet but it looks fine to me, waiting for Paolo's review.

 I have been discussing with Jan in the past weeks because most of the
 questions are technical affairs, which I don't think it's suitable to
 discuss in the community. This is a rather mature version and I think
 I can commit it with some minor changes.
 
  unexpected occasions and the hypervisor can run if we simply ignore
  them. After all bugs fixed, everything will be OK.
 
  Arthur
 
 
  --
  Gleb.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 11:27, Arthur Chunqi Li ha scritto:
 This is the first version for VMX nested environment test case. It
 contains the basic VMX instructions test cases, including VMXON/
 VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
 also tests the basic execution routine in VMX nested environment and
 let the VM print Hello World to inform its successfully run.
 
 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  568 
 +
  x86/vmx.h |  406 +++
  7 files changed, 992 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h
 
 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
  
  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
  
 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
   $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat
  
  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN0x048d
 +#define MSR_IA32_VMX_TRUE_PROC   0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT   0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY  0x0490
 +
  
  /* AMD-V MSRs */
  
 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0
  
 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20
  
  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..e846739 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu Nehalem,+vmx

Should this use -cpu host instead? (Or -cpu host,+vmx, I don't
remember).

 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..0435746
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,568 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +void *io_bmp1, *io_bmp2;
 +void *msr_bmp;
 +u32 vpid_ctr;
 +char *guest_stack, *host_stack;
 +char *guest_syscall_stack, *host_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *entry_vmx;
 +extern void *entry_sysenter;
 +extern void *entry_guest;
 +
 +void report(const char *name, int result)
 +{
 + ++tests;
 + if (result)
 + printf(PASS: %s\n, name);
 + else {
 + printf(FAIL: %s\n, name);
 + ++fails;
 + }
 +}
 +
 +inline u64 get_rflags(void)
 +{
 + u64 r;
 + asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
 + return r;
 +}
 +
 +inline void set_rflags(u64 r)
 +{
 + asm volatile(push %0; popf\n\t : : q(r) : cc);
 +}
 +
 +int vmcs_clear(struct vmcs *vmcs)
 +{
 + bool ret;
 + asm volatile (vmclear %1; seta %0 : =q (ret) : m (vmcs) : cc);

You can use setbe, it's clearer and avoids the ! in the return statement.

We should later add tests for failure conditions, since failing to
detect errors could give rise to L2-L1 attack vectors.  When we do so,
we will have to distinguish CF from ZF.


 + return !ret;
 +}
 +
 +u64 vmcs_read(enum Encoding enc)
 +{
 + u64 val;
 + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
 + return val;
 +}
 +
 +int vmcs_write(enum Encoding enc, u64 val)

Re: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 12:40, Naor Shlomo ha scritto:
 Hi Paolo,
 
 For some unknown reason it suddenly started to accept the changes to the XML 
 and the strings you gave me are now in place.

Good.

 Upon machine start I now receive the following error messages:
 
 virsh # start NaorDev
 error: Failed to start domain NaorDev
 error: internal error Process exited while reading console log output: kvm: 
 -global: requires an argument
 

That's because I cut-and-pasted without reading:

   qemu:commandline
 qemu:arg value='-global'/
 qemu:env name='mrg_rxbuf' value='off'/
   /qemu:commandline

The right one is (or at this point I'd better say should be):

   qemu:commandline
 qemu:arg value='-global'/
 qemu:arg value='virtio-net-pci.mrg_rxbuf=off'/
   /qemu:commandline

Paolo

 /domain
 
 Naor
 
 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com] 
 Sent: Tuesday, July 16, 2013 12:36 PM
 To: Naor Shlomo
 Cc: kvm@vger.kernel.org
 Subject: Re: Disabling mergeable rx buffers for the guest
 
 Il 16/07/2013 10:06, Naor Shlomo ha scritto:
 Thanks again Paolo,

 I used your string and read the documents in the site you referred me to but 
 could not understand why doesn't it accept the 
 xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' string.

 I tried it on the following version:
 Compiled against library: libvir 0.9.8 Using library: libvir 0.9.8 
 Using API: QEMU 0.9.8 Running hypervisor: QEMU 1.0.0

 So according to the site it should be supported.

 Any idea what am I missing now?
 
 Not sure.  Can you post here the full XML, and the one you're trying to use?
 
 Paolo
 
 Naor

 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 Sent: Tuesday, July 16, 2013 9:42 AM
 To: Naor Shlomo
 Cc: kvm@vger.kernel.org
 Subject: Re: Disabling mergeable rx buffers for the guest

 Il 16/07/2013 08:40, Naor Shlomo ha scritto:
 Hi Paolo and thanks for your quick reply,

 I tried editing (virsh edit) the domain's XML and put the XML excerpt you 
 gave me everywhere but with no success.
 The moment I exit the edit mode the text was gone (I guess it didn't pass 
 some sort of sanity and that's why it was automatically erased).

 What am I doing wrong?

 My fault.  You need to change the domain opening tag to

 domain type='qemu' 
 xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'

 See http://libvirt.org/drvqemu.html#qemucommand for the docs.

 Paolo

 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Daniel P. Berrange

On Tue, Jul 16, 2013 at 10:40:28AM +, Naor Shlomo wrote:
 Hi Paolo,
 
 For some unknown reason it suddenly started to accept the changes to the XML 
 and the strings you gave me are now in place.
 Upon machine start I now receive the following error messages:
 
 virsh # start NaorDev
 error: Failed to start domain NaorDev
 error: internal error Process exited while reading console log output: kvm: 
 -global: requires an argument
 
 Here's the XML:
 

   qemu:commandline
 qemu:arg value='-global'/
 qemu:env name='mrg_rxbuf' value='off'/
   /qemu:commandline

Presumably what you wanted to do was

   qemu:commandline
 qemu:arg value='-global'/
 qemu:arg value='mrg_rxbuf=off'/
   /qemu:commandline

Rather than setting an environment variable.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] KVM: nVMX: Set segment infomation of L1 when L2 exits

2013-07-16 Thread Paolo Bonzini

Il 15/07/2013 10:04, Arthur Chunqi Li ha scritto:
 When L2 exits to L1, segment infomations of L1 are not set correctly.
 According to Intel SDM 27.5.2(Loading Host Segment and Descriptor
 Table Registers), segment base/limit/access right of L1 should be
 set to some designed value when L2 exits to L1. This patch fixes
 this.
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  arch/x86/kvm/vmx.c |   58 
 +++-
  1 file changed, 48 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 064d0be..71eb55b 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7948,6 +7948,8 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
  static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
  struct vmcs12 *vmcs12)
  {
 + struct kvm_segment seg;
 +
   if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_EFER)
   vcpu-arch.efer = vmcs12-host_ia32_efer;
   else if (vmcs12-vm_exit_controls  VM_EXIT_HOST_ADDR_SPACE_SIZE)
 @@ -8001,16 +8003,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
 *vcpu,
   vmcs_writel(GUEST_SYSENTER_EIP, vmcs12-host_ia32_sysenter_eip);
   vmcs_writel(GUEST_IDTR_BASE, vmcs12-host_idtr_base);
   vmcs_writel(GUEST_GDTR_BASE, vmcs12-host_gdtr_base);
 - vmcs_writel(GUEST_TR_BASE, vmcs12-host_tr_base);
 - vmcs_writel(GUEST_GS_BASE, vmcs12-host_gs_base);
 - vmcs_writel(GUEST_FS_BASE, vmcs12-host_fs_base);
 - vmcs_write16(GUEST_ES_SELECTOR, vmcs12-host_es_selector);
 - vmcs_write16(GUEST_CS_SELECTOR, vmcs12-host_cs_selector);
 - vmcs_write16(GUEST_SS_SELECTOR, vmcs12-host_ss_selector);
 - vmcs_write16(GUEST_DS_SELECTOR, vmcs12-host_ds_selector);
 - vmcs_write16(GUEST_FS_SELECTOR, vmcs12-host_fs_selector);
 - vmcs_write16(GUEST_GS_SELECTOR, vmcs12-host_gs_selector);
 - vmcs_write16(GUEST_TR_SELECTOR, vmcs12-host_tr_selector);
  
   if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_PAT)
   vmcs_write64(GUEST_IA32_PAT, vmcs12-host_ia32_pat);
 @@ -8018,6 +8010,52 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
 *vcpu,
   vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
   vmcs12-host_ia32_perf_global_ctrl);
  
 + /* Set L1 segment info according to Intel SDM
 + 27.5.2 Loading Host Segment and Descriptor-Table Registers */
 + seg = (struct kvm_segment) {
 + .base = 0,
 + .limit = 0x,
 + .selector = vmcs12-host_cs_selector,
 + .type = 11,
 + .present = 1,
 + .s = 1,
 + .g = 1
 + };
 + if (vmcs12-vm_exit_controls  VM_EXIT_HOST_ADDR_SPACE_SIZE)
 + seg.l = 1;
 + else
 + seg.db = 1;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_CS);
 + seg = (struct kvm_segment) {
 + .base = 0,
 + .limit = 0x,
 + .type = 3,
 + .present = 1,
 + .s = 1,
 + .db = 1,
 + .g = 1
 + };
 + seg.selector = vmcs12-host_ds_selector;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_DS);
 + seg.selector = vmcs12-host_es_selector;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_ES);
 + seg.selector = vmcs12-host_ss_selector;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_SS);
 + seg.selector = vmcs12-host_fs_selector;
 + seg.base = vmcs12-host_fs_base;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_FS);
 + seg.selector = vmcs12-host_gs_selector;
 + seg.base = vmcs12-host_gs_base;
 + vmx_set_segment(vcpu, seg, VCPU_SREG_GS);
 + seg = (struct kvm_segment) {
 + .base = 0,
 + .limit = 0x67,
 + .selector = vmcs12-host_tr_selector,
 + .type = 11,
 + .present = 1
 + };
 + vmx_set_segment(vcpu, seg, VCPU_SREG_TR);
 +
   kvm_set_dr(vcpu, 7, 0x400);
   vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
  }
 

Applied, thanks.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Naor Shlomo

Paolo, 

Sorry for all the trouble.
We got a progress. Your last qemu:commandline really worked and I was able to 
try and start the machine.
The problem is that I reached up to the tun driver load section and then the 
machine shut itself down.

The last lines of the console showed the following:

[1.568226] Fixed MDIO Bus: probed
[1.568833] tun: Universal TUN/TAP device driver, 1.6
[1.569108] tun: (C) 1999-2004 Max Krasnyansky m...@qualcomm.com

And it got me back to the virsh command line.

Here's an output of the log file:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin 
QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc-1.0 -no-kvm -m 2069 -smp 
1,sockets=1,cores=1,threads=1 -name NaorDev -uuid 
60b5a0ab-8932-47c1-8674-760c7e1f4743 -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/NaorDev.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-drive 
file=/var/lib/libvirt/images/NaorDev.img,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/var/lib/libvirt/images/NaorAddonHdd.qcow,if=none,id=drive-virtio-disk1,format=qcow2,cache=none
 -device 
virtio-blk-pci,bus=pci.0,addr=0x8,drive=drive-virtio-disk1,id=virtio-disk1 
-netdev tap,fd=18,id=hostnet0,vhost=on,vhostfd=19 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c6:b7:a3,bus=pci.0,addr=0x3 
-netdev tap,fd=20,id=hostnet1,vhost=on,vhostfd=21 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:9e:44:90,bus=pci.0,addr=0x4 
-netdev tap,fd=22,id=hostnet2,vhost=on,vhostfd=23 -device 
virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:e7:f9:bf,bus=pci.0,addr=0x7 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-usb -vnc 127.0.0.1:0 -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -global 
virtio-net-pci.mrg_rxbuf=off
Domain id=9 is tainted: custom-argv
char device redirected to /dev/pts/1
2013-07-16 12:48:26.440+: shutting down

and here's the output of lsmod:

vhost_net  32360  0 
macvtap18529  1 vhost_net
macvlan19003  1 macvtap
ip6table_filter12816  0 
ip6_tables 27686  1 ip6table_filter
ebtable_nat12808  0 
ebtables   31024  1 ebtable_nat
ipt_MASQUERADE 12760  3 
iptable_nat13230  1 
nf_nat 25646  2 ipt_MASQUERADE,iptable_nat
nf_conntrack_ipv4  14531  4 iptable_nat,nf_nat
nf_defrag_ipv4 12730  1 nf_conntrack_ipv4
xt_state   12579  1 
nf_conntrack   83300  5 
ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state
ipt_REJECT 12577  2 
xt_CHECKSUM12550  1 
iptable_mangle 12735  1 
xt_tcpudp  12604  5 
iptable_filter 12811  1 
ip_tables  27474  3 iptable_nat,iptable_mangle,iptable_filter
x_tables   29892  12 
ip6table_filter,ip6_tables,ebtables,ipt_MASQUERADE,iptable_nat,xt_state,ipt_REJECT,xt_CHECKSUM,iptable_mangle,xt_tcpudp,iptable_filter,ip_tables
dm_crypt   23126  0 
coretemp   13642  0 
kvm_intel 137888  0 
kvm   422160  1 kvm_intel
ext2   73799  1 
gpio_ich   13384  0 
dm_multipath   23306  0 
scsi_dh14589  1 dm_multipath
lp 17800  0 
microcode  23030  0 
parport46563  1 lp
bridge 91498  0 
sb_edac18104  0 
stp12977  1 bridge
llc14598  2 bridge,stp
joydev 17694  0 
edac_core  53053  3 sb_edac
shpchp 37278  0 
lpc_ich17145  0 
mei41410  0 
acpi_power_meter   18124  0 
mac_hid13254  0 
btrfs 781017  0 
zlib_deflate   27140  1 btrfs
libcrc32c  12645  1 btrfs
vesafb 13846  1 
hid_generic12541  0 
usbhid 47259  0 
hid   100815  2 hid_generic,usbhid
ses17418  0 
enclosure  15313  1 ses
ghash_clmulni_intel13221  0 
aesni_intel51134  0 
cryptd 20531  2 ghash_clmulni_intel,aesni_intel
aes_x86_64 17256  1 aesni_intel
megaraid_sas   82737  2 
ixgbe 184924  0 
dca15233  1 ixgbe
mdio   13808  1 ixgbe
tg3   156594  0 
wmi19257  0

I will appreciate it if you tell me what I am missing now.

Thanks,
Naor

-Original Message-
From: Paolo Bonzini [mailto:pbonz...@redhat.com] 
Sent: Tuesday, July 16, 2013 1:42 PM
To: Naor Shlomo
Cc: kvm@vger.kernel.org
Subject: Re: Disabling mergeable rx buffers for the guest

Il 16/07/2013 12:40, Naor Shlomo ha

RE: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Naor Shlomo

Thanks Daniel,

Paolo already got it right, it's:

qemu:commandline
  qemu:arg value='-global'/
  qemu:arg value='virtio-net-pci.mrg_rxbuf=off'/
/qemu:commandline

-Original Message-
From: Daniel P. Berrange [mailto:berra...@redhat.com] 
Sent: Tuesday, July 16, 2013 1:43 PM
To: Naor Shlomo
Cc: Paolo Bonzini; kvm@vger.kernel.org
Subject: Re: Disabling mergeable rx buffers for the guest

On Tue, Jul 16, 2013 at 10:40:28AM +, Naor Shlomo wrote:
 Hi Paolo,
 
 For some unknown reason it suddenly started to accept the changes to the XML 
 and the strings you gave me are now in place.
 Upon machine start I now receive the following error messages:
 
 virsh # start NaorDev
 error: Failed to start domain NaorDev
 error: internal error Process exited while reading console log output: kvm: 
 -global: requires an argument
 
 Here's the XML:
 

   qemu:commandline
 qemu:arg value='-global'/
 qemu:env name='mrg_rxbuf' value='off'/
   /qemu:commandline

Presumably what you wanted to do was

   qemu:commandline
 qemu:arg value='-global'/
 qemu:arg value='mrg_rxbuf=off'/
   /qemu:commandline

Rather than setting an environment variable.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-07-16 Thread Gleb Natapov

On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote:
 The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
 to clear this MSR when reset vCPU and keep the value of it when
 migration. This patch add this feature.
 
So what happens if we migrate from qemu that does not have this patch
to qemu that does? Since msr_ia32_feature_control will not be migrated
it will not be set on the destination so destination will not be able to
use nested vmx. Since nested vmx is experimental it may be to early for
us to care about it though, and nested vmx does not work with migration
anyway.

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  target-i386/cpu.h |2 ++
  target-i386/kvm.c |4 
  target-i386/machine.c |   22 ++
  3 files changed, 28 insertions(+)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 62e3547..a418e17 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -301,6 +301,7 @@
  #define MSR_IA32_APICBASE_BSP   (18)
  #define MSR_IA32_APICBASE_ENABLE(111)
  #define MSR_IA32_APICBASE_BASE  (0xf12)
 +#define MSR_IA32_FEATURE_CONTROL0x003a
  #define MSR_TSC_ADJUST  0x003b
  #define MSR_IA32_TSCDEADLINE0x6e0
  
 @@ -813,6 +814,7 @@ typedef struct CPUX86State {
  
  uint64_t mcg_status;
  uint64_t msr_ia32_misc_enable;
 +uint64_t msr_ia32_feature_control;
  
  /* exception/interrupt handling */
  int error_code;
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 39f4fbb..3cb2161 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  if (hyperv_vapic_recommended()) {
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
  }
 +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, 
 env-msr_ia32_feature_control);
  }
  if (env-mcg_cap) {
  int i;
 @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu)
  if (has_msr_misc_enable) {
  msrs[n++].index = MSR_IA32_MISC_ENABLE;
  }
 +msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
  
  if (!env-tsc_valid) {
  msrs[n++].index = MSR_IA32_TSC;
 @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu)
  case MSR_IA32_MISC_ENABLE:
  env-msr_ia32_misc_enable = msrs[i].data;
  break;
 +case MSR_IA32_FEATURE_CONTROL:
 +env-msr_ia32_feature_control = msrs[i].data;
  default:
  if (msrs[i].index = MSR_MC0_CTL 
  msrs[i].index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index 3659db9..94ca914 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque)
  return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
  }
  
 +static bool feature_control_needed(void *opaque)
 +{
 +X86CPU *cpu = opaque;
 +CPUX86State *env = cpu-env;
 +
 +return env-msr_ia32_feature_control != 0;
 +}
 +
  static const VMStateDescription vmstate_msr_ia32_misc_enable = {
  .name = cpu/msr_ia32_misc_enable,
  .version_id = 1,
 @@ -410,6 +418,17 @@ static const VMStateDescription 
 vmstate_msr_ia32_misc_enable = {
  }
  };
  
 +static const VMStateDescription vmstate_msr_ia32_feature_control = {
 +.name = cpu/msr_ia32_feature_control,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.fields  = (VMStateField []) {
 +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
  const VMStateDescription vmstate_x86_cpu = {
  .name = cpu,
  .version_id = 12,
 @@ -535,6 +554,9 @@ const VMStateDescription vmstate_x86_cpu = {
  }, {
  .vmsd = vmstate_msr_ia32_misc_enable,
  .needed = misc_enable_needed,
 +}, {
 +.vmsd = vmstate_msr_ia32_feature_control,
 +.needed = feature_control_needed,
  } , {
  /* empty */
  }
 -- 
 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

Hi Paolo,

On Tue, Jul 16, 2013 at 6:28 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 16/07/2013 11:27, Arthur Chunqi Li ha scritto:
 This is the first version for VMX nested environment test case. It
 contains the basic VMX instructions test cases, including VMXON/
 VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
 also tests the basic execution routine in VMX nested environment and
 let the VM print Hello World to inform its successfully run.

 New files added:
 x86/vmx.h : contains all VMX related macro declerations
 x86/vmx.c : main file for VMX nested test case

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  config-x86-common.mak |2 +
  config-x86_64.mak |1 +
  lib/x86/msr.h |5 +
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  568 
 +
  x86/vmx.h |  406 +++
  7 files changed, 992 insertions(+)
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h

 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index 455032b..34a41e1 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) 
 $(TEST_DIR)/asyncpf.o

  $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o

 +$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
 +
  arch_clean:
   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
   $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..bb8ee89 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -9,5 +9,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 $(TEST_DIR)/pcid.flat
  tests += $(TEST_DIR)/svm.flat
 +tests += $(TEST_DIR)/vmx.flat

  include config-x86-common.mak
 diff --git a/lib/x86/msr.h b/lib/x86/msr.h
 index 509a421..281255a 100644
 --- a/lib/x86/msr.h
 +++ b/lib/x86/msr.h
 @@ -396,6 +396,11 @@
  #define MSR_IA32_VMX_VMCS_ENUM  0x048a
  #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
  #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
 +#define MSR_IA32_VMX_TRUE_PIN0x048d
 +#define MSR_IA32_VMX_TRUE_PROC   0x048e
 +#define MSR_IA32_VMX_TRUE_EXIT   0x048f
 +#define MSR_IA32_VMX_TRUE_ENTRY  0x0490
 +

  /* AMD-V MSRs */

 diff --git a/x86/cstart64.S b/x86/cstart64.S
 index 24df5f8..0fe76da 100644
 --- a/x86/cstart64.S
 +++ b/x86/cstart64.S
 @@ -4,6 +4,10 @@
  .globl boot_idt
  boot_idt = 0

 +.globl idt_descr
 +.globl tss_descr
 +.globl gdt64_desc
 +
  ipi_vector = 0x20

  max_cpus = 64
 diff --git a/x86/unittests.cfg b/x86/unittests.cfg
 index bc9643e..e846739 100644
 --- a/x86/unittests.cfg
 +++ b/x86/unittests.cfg
 @@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
  file = pcid.flat
  extra_params = -cpu qemu64,+pcid
  arch = x86_64
 +
 +[vmx]
 +file = vmx.flat
 +extra_params = -cpu Nehalem,+vmx

 Should this use -cpu host instead? (Or -cpu host,+vmx, I don't
 remember).

 +arch = x86_64
 +
 diff --git a/x86/vmx.c b/x86/vmx.c
 new file mode 100644
 index 000..0435746
 --- /dev/null
 +++ b/x86/vmx.c
 @@ -0,0 +1,568 @@
 +#include libcflat.h
 +#include processor.h
 +#include vm.h
 +#include desc.h
 +#include vmx.h
 +#include msr.h
 +#include smp.h
 +#include io.h
 +
 +
 +int fails = 0, tests = 0;
 +u32 *vmxon_region;
 +struct vmcs *vmcs_root;
 +void *io_bmp1, *io_bmp2;
 +void *msr_bmp;
 +u32 vpid_ctr;
 +char *guest_stack, *host_stack;
 +char *guest_syscall_stack, *host_syscall_stack;
 +u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
 +ulong fix_cr0_set, fix_cr0_clr;
 +ulong fix_cr4_set, fix_cr4_clr;
 +struct regs regs;
 +
 +extern u64 gdt64_desc[];
 +extern u64 idt_descr[];
 +extern u64 tss_descr[];
 +extern void *entry_vmx;
 +extern void *entry_sysenter;
 +extern void *entry_guest;
 +
 +void report(const char *name, int result)
 +{
 + ++tests;
 + if (result)
 + printf(PASS: %s\n, name);
 + else {
 + printf(FAIL: %s\n, name);
 + ++fails;
 + }
 +}
 +
 +inline u64 get_rflags(void)
 +{
 + u64 r;
 + asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
 + return r;
 +}
 +
 +inline void set_rflags(u64 r)
 +{
 + asm volatile(push %0; popf\n\t : : q(r) : cc);
 +}
 +
 +int vmcs_clear(struct vmcs *vmcs)
 +{
 + bool ret;
 + asm volatile (vmclear %1; seta %0 : =q (ret) : m (vmcs) : cc);

 You can use setbe, it's clearer and avoids the ! in the return statement.

 We should later add tests for failure conditions, since failing to
 detect errors could give rise to L2-L1 attack vectors.  When we do so,
 we will have to distinguish CF from ZF.


 + return !ret;
 +}
 +
 +u64 vmcs_read(enum Encoding enc)
 +{
 + u64 val;
 + asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) :

Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 13:42, Gleb Natapov ha scritto:
 On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote:
 The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
 to clear this MSR when reset vCPU and keep the value of it when
 migration. This patch add this feature.

 So what happens if we migrate from qemu that does not have this patch
 to qemu that does? Since msr_ia32_feature_control will not be migrated
 it will not be set on the destination so destination will not be able to
 use nested vmx.

Yes, that's the same as with every subsection.

 Since nested vmx is experimental it may be to early for
 us to care about it though, and nested vmx does not work with migration
 anyway.

Exactly.  The next time you start KVM, it will set the MSR to 5 again.

Paolo

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  target-i386/cpu.h |2 ++
  target-i386/kvm.c |4 
  target-i386/machine.c |   22 ++
  3 files changed, 28 insertions(+)

 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 62e3547..a418e17 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -301,6 +301,7 @@
  #define MSR_IA32_APICBASE_BSP   (18)
  #define MSR_IA32_APICBASE_ENABLE(111)
  #define MSR_IA32_APICBASE_BASE  (0xf12)
 +#define MSR_IA32_FEATURE_CONTROL0x003a
  #define MSR_TSC_ADJUST  0x003b
  #define MSR_IA32_TSCDEADLINE0x6e0
  
 @@ -813,6 +814,7 @@ typedef struct CPUX86State {
  
  uint64_t mcg_status;
  uint64_t msr_ia32_misc_enable;
 +uint64_t msr_ia32_feature_control;
  
  /* exception/interrupt handling */
  int error_code;
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 39f4fbb..3cb2161 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  if (hyperv_vapic_recommended()) {
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
  }
 +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, 
 env-msr_ia32_feature_control);
  }
  if (env-mcg_cap) {
  int i;
 @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu)
  if (has_msr_misc_enable) {
  msrs[n++].index = MSR_IA32_MISC_ENABLE;
  }
 +msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
  
  if (!env-tsc_valid) {
  msrs[n++].index = MSR_IA32_TSC;
 @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu)
  case MSR_IA32_MISC_ENABLE:
  env-msr_ia32_misc_enable = msrs[i].data;
  break;
 +case MSR_IA32_FEATURE_CONTROL:
 +env-msr_ia32_feature_control = msrs[i].data;
  default:
  if (msrs[i].index = MSR_MC0_CTL 
  msrs[i].index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index 3659db9..94ca914 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque)
  return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
  }
  
 +static bool feature_control_needed(void *opaque)
 +{
 +X86CPU *cpu = opaque;
 +CPUX86State *env = cpu-env;
 +
 +return env-msr_ia32_feature_control != 0;
 +}
 +
  static const VMStateDescription vmstate_msr_ia32_misc_enable = {
  .name = cpu/msr_ia32_misc_enable,
  .version_id = 1,
 @@ -410,6 +418,17 @@ static const VMStateDescription 
 vmstate_msr_ia32_misc_enable = {
  }
  };
  
 +static const VMStateDescription vmstate_msr_ia32_feature_control = {
 +.name = cpu/msr_ia32_feature_control,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.fields  = (VMStateField []) {
 +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
  const VMStateDescription vmstate_x86_cpu = {
  .name = cpu,
  .version_id = 12,
 @@ -535,6 +554,9 @@ const VMStateDescription vmstate_x86_cpu = {
  }, {
  .vmsd = vmstate_msr_ia32_misc_enable,
  .needed = misc_enable_needed,
 +}, {
 +.vmsd = vmstate_msr_ia32_feature_control,
 +.needed = feature_control_needed,
  } , {
  /* empty */
  }
 -- 
 1.7.9.5
 
 --
   Gleb.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-07-16 Thread Arthur Chunqi Li

On Tue, Jul 16, 2013 at 7:42 PM, Gleb Natapov g...@redhat.com wrote:
 On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote:
 The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
 to clear this MSR when reset vCPU and keep the value of it when
 migration. This patch add this feature.

 So what happens if we migrate from qemu that does not have this patch
 to qemu that does? Since msr_ia32_feature_control will not be migrated
 it will not be set on the destination so destination will not be able to
 use nested vmx. Since nested vmx is experimental it may be to early for
 us to care about it though, and nested vmx does not work with migration
 anyway.
In my test, if migration doesn't care about msr_ia32_feature_control,
the value will be set to 0 in the destination VM and this may cause
some logical confusions, but the VMX running on it may not aware of
this (if migration nested vmx is supported in the future) because once
VMX initialized, it will not check this msr any more in normal cases.

This is also a complex problem since we don't know how many states
like this msr need to be migrated related to nested virt. If there're
a lot of states need migrating, it is better to reconstruct the
relevant codes. But now this patch is enough.

Besides, though migration is not supported in nested vmx, we should
keep the machine state consistent during migration. So this patch is
also meaningful.

Arthur

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  target-i386/cpu.h |2 ++
  target-i386/kvm.c |4 
  target-i386/machine.c |   22 ++
  3 files changed, 28 insertions(+)

 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 62e3547..a418e17 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -301,6 +301,7 @@
  #define MSR_IA32_APICBASE_BSP   (18)
  #define MSR_IA32_APICBASE_ENABLE(111)
  #define MSR_IA32_APICBASE_BASE  (0xf12)
 +#define MSR_IA32_FEATURE_CONTROL0x003a
  #define MSR_TSC_ADJUST  0x003b
  #define MSR_IA32_TSCDEADLINE0x6e0

 @@ -813,6 +814,7 @@ typedef struct CPUX86State {

  uint64_t mcg_status;
  uint64_t msr_ia32_misc_enable;
 +uint64_t msr_ia32_feature_control;

  /* exception/interrupt handling */
  int error_code;
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 39f4fbb..3cb2161 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  if (hyperv_vapic_recommended()) {
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
  }
 +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, 
 env-msr_ia32_feature_control);
  }
  if (env-mcg_cap) {
  int i;
 @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu)
  if (has_msr_misc_enable) {
  msrs[n++].index = MSR_IA32_MISC_ENABLE;
  }
 +msrs[n++].index = MSR_IA32_FEATURE_CONTROL;

  if (!env-tsc_valid) {
  msrs[n++].index = MSR_IA32_TSC;
 @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu)
  case MSR_IA32_MISC_ENABLE:
  env-msr_ia32_misc_enable = msrs[i].data;
  break;
 +case MSR_IA32_FEATURE_CONTROL:
 +env-msr_ia32_feature_control = msrs[i].data;
  default:
  if (msrs[i].index = MSR_MC0_CTL 
  msrs[i].index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index 3659db9..94ca914 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque)
  return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
  }

 +static bool feature_control_needed(void *opaque)
 +{
 +X86CPU *cpu = opaque;
 +CPUX86State *env = cpu-env;
 +
 +return env-msr_ia32_feature_control != 0;
 +}
 +
  static const VMStateDescription vmstate_msr_ia32_misc_enable = {
  .name = cpu/msr_ia32_misc_enable,
  .version_id = 1,
 @@ -410,6 +418,17 @@ static const VMStateDescription 
 vmstate_msr_ia32_misc_enable = {
  }
  };

 +static const VMStateDescription vmstate_msr_ia32_feature_control = {
 +.name = cpu/msr_ia32_feature_control,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.fields  = (VMStateField []) {
 +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
  const VMStateDescription vmstate_x86_cpu = {
  .name = cpu,
  .version_id = 12,
 @@ -535,6 +554,9 @@ const VMStateDescription vmstate_x86_cpu = {
  }, {
  .vmsd = vmstate_msr_ia32_misc_enable,
  .needed = misc_enable_needed,
 +}, {
 +.vmsd = vmstate_msr_ia32_feature_control,
 +.needed = feature_control_needed,
  } , {
  /* empty */

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 13:47, Arthur Chunqi Li ha scritto:
 setjmp and longjmp is not implemented in our test environment, and
 these two functions are highly depend on architecture. Do you think we
 need to write a general code for both 32bit and 64bit, or just write
 specific one just for this test case?

Supporting it in x86-64 is enough for now.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [[Qemu-devel] [PATCH]] nVMX: Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 07:56:25PM +0800, Arthur Chunqi Li wrote:
 On Tue, Jul 16, 2013 at 7:42 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, Jul 07, 2013 at 11:13:37PM +0800, Arthur Chunqi Li wrote:
  The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
  to clear this MSR when reset vCPU and keep the value of it when
  migration. This patch add this feature.
 
  So what happens if we migrate from qemu that does not have this patch
  to qemu that does? Since msr_ia32_feature_control will not be migrated
  it will not be set on the destination so destination will not be able to
  use nested vmx. Since nested vmx is experimental it may be to early for
  us to care about it though, and nested vmx does not work with migration
  anyway.
 In my test, if migration doesn't care about msr_ia32_feature_control,
 the value will be set to 0 in the destination VM and this may cause
 some logical confusions, but the VMX running on it may not aware of
 this (if migration nested vmx is supported in the future) because once
 VMX initialized, it will not check this msr any more in normal cases.
 
With vmm_exclusive=0 kvm does vmxon/vmxoff while running. But lest not
worry about nested kvm migration for now. There are much harder problems
to overcome before it will work.

 This is also a complex problem since we don't know how many states
 like this msr need to be migrated related to nested virt. If there're
 a lot of states need migrating, it is better to reconstruct the
 relevant codes. But now this patch is enough.
 
 Besides, though migration is not supported in nested vmx, we should
 keep the machine state consistent during migration. So this patch is
 also meaningful.
 
 Arthur
 
  Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
  ---
   target-i386/cpu.h |2 ++
   target-i386/kvm.c |4 
   target-i386/machine.c |   22 ++
   3 files changed, 28 insertions(+)
 
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 62e3547..a418e17 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -301,6 +301,7 @@
   #define MSR_IA32_APICBASE_BSP   (18)
   #define MSR_IA32_APICBASE_ENABLE(111)
   #define MSR_IA32_APICBASE_BASE  (0xf12)
  +#define MSR_IA32_FEATURE_CONTROL0x003a
   #define MSR_TSC_ADJUST  0x003b
   #define MSR_IA32_TSCDEADLINE0x6e0
 
  @@ -813,6 +814,7 @@ typedef struct CPUX86State {
 
   uint64_t mcg_status;
   uint64_t msr_ia32_misc_enable;
  +uint64_t msr_ia32_feature_control;
 
   /* exception/interrupt handling */
   int error_code;
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index 39f4fbb..3cb2161 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -1122,6 +1122,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
   if (hyperv_vapic_recommended()) {
   kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
   }
  +kvm_msr_entry_set(msrs[n++], MSR_IA32_FEATURE_CONTROL, 
  env-msr_ia32_feature_control);
   }
   if (env-mcg_cap) {
   int i;
  @@ -1346,6 +1347,7 @@ static int kvm_get_msrs(X86CPU *cpu)
   if (has_msr_misc_enable) {
   msrs[n++].index = MSR_IA32_MISC_ENABLE;
   }
  +msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
 
   if (!env-tsc_valid) {
   msrs[n++].index = MSR_IA32_TSC;
  @@ -1444,6 +1446,8 @@ static int kvm_get_msrs(X86CPU *cpu)
   case MSR_IA32_MISC_ENABLE:
   env-msr_ia32_misc_enable = msrs[i].data;
   break;
  +case MSR_IA32_FEATURE_CONTROL:
  +env-msr_ia32_feature_control = msrs[i].data;
   default:
   if (msrs[i].index = MSR_MC0_CTL 
   msrs[i].index  MSR_MC0_CTL + (env-mcg_cap  0xff) * 4) {
  diff --git a/target-i386/machine.c b/target-i386/machine.c
  index 3659db9..94ca914 100644
  --- a/target-i386/machine.c
  +++ b/target-i386/machine.c
  @@ -399,6 +399,14 @@ static bool misc_enable_needed(void *opaque)
   return env-msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
   }
 
  +static bool feature_control_needed(void *opaque)
  +{
  +X86CPU *cpu = opaque;
  +CPUX86State *env = cpu-env;
  +
  +return env-msr_ia32_feature_control != 0;
  +}
  +
   static const VMStateDescription vmstate_msr_ia32_misc_enable = {
   .name = cpu/msr_ia32_misc_enable,
   .version_id = 1,
  @@ -410,6 +418,17 @@ static const VMStateDescription 
  vmstate_msr_ia32_misc_enable = {
   }
   };
 
  +static const VMStateDescription vmstate_msr_ia32_feature_control = {
  +.name = cpu/msr_ia32_feature_control,
  +.version_id = 1,
  +.minimum_version_id = 1,
  +.minimum_version_id_old = 1,
  +.fields  = (VMStateField []) {
  +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
  +VMSTATE_END_OF_LIST()
  +}
  +};
  +
   const VMStateDescription vmstate_x86_cpu = {
   .name = cpu,

RE: Disabling mergeable rx buffers for the guest

2013-07-16 Thread Naor Shlomo

Paolo,

Thanks for your help in this matter. Issue is resolved.

One more thing I changed in order to get the machine up and running is this:
domain type='qemu' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'
has been changed to:
domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'

I verified by running:
cat /sys/bus/virtio/devices/virtio0/features

That the 15th bit is off hence mergeable buffers are disabled.

Thanks again,
Naor


-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Naor Shlomo
Sent: Tuesday, July 16, 2013 2:06 PM
To: Paolo Bonzini
Cc: kvm@vger.kernel.org
Subject: RE: Disabling mergeable rx buffers for the guest

Paolo, 

Sorry for all the trouble.
We got a progress. Your last qemu:commandline really worked and I was able to 
try and start the machine.
The problem is that I reached up to the tun driver load section and then the 
machine shut itself down.

The last lines of the console showed the following:

[1.568226] Fixed MDIO Bus: probed
[1.568833] tun: Universal TUN/TAP device driver, 1.6
[1.569108] tun: (C) 1999-2004 Max Krasnyansky m...@qualcomm.com

And it got me back to the virsh command line.

Here's an output of the log file:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin 
QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc-1.0 -no-kvm -m 2069 -smp 
1,sockets=1,cores=1,threads=1 -name NaorDev -uuid 
60b5a0ab-8932-47c1-8674-760c7e1f4743 -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/NaorDev.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-drive 
file=/var/lib/libvirt/images/NaorDev.img,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/var/lib/libvirt/images/NaorAddonHdd.qcow,if=none,id=drive-virtio-disk1,format=qcow2,cache=none
 -device 
virtio-blk-pci,bus=pci.0,addr=0x8,drive=drive-virtio-disk1,id=virtio-disk1 
-netdev tap,fd=18,id=hostnet0,vhost=on,vhostfd=19 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c6:b7:a3,bus=pci.0,addr=0x3 
-netdev tap,fd=20,id=hostnet1,vhost=on,vhostfd=21 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:9e:44:90,bus=pci.0,addr=0x4 
-netdev tap,fd=22,id=hostnet2,vhost=on,vhostfd=23 -device 
virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:e7:f9:bf,bus=pci.0,addr=0x7 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-usb -vnc 127.0.0.1:0 -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -global 
virtio-net-pci.mrg_rxbuf=off Domain id=9 is tainted: custom-argv char device 
redirected to /dev/pts/1
2013-07-16 12:48:26.440+: shutting down

and here's the output of lsmod:

vhost_net  32360  0 
macvtap18529  1 vhost_net
macvlan19003  1 macvtap
ip6table_filter12816  0 
ip6_tables 27686  1 ip6table_filter
ebtable_nat12808  0 
ebtables   31024  1 ebtable_nat
ipt_MASQUERADE 12760  3 
iptable_nat13230  1 
nf_nat 25646  2 ipt_MASQUERADE,iptable_nat
nf_conntrack_ipv4  14531  4 iptable_nat,nf_nat
nf_defrag_ipv4 12730  1 nf_conntrack_ipv4
xt_state   12579  1 
nf_conntrack   83300  5 
ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state
ipt_REJECT 12577  2 
xt_CHECKSUM12550  1 
iptable_mangle 12735  1 
xt_tcpudp  12604  5 
iptable_filter 12811  1 
ip_tables  27474  3 iptable_nat,iptable_mangle,iptable_filter
x_tables   29892  12 
ip6table_filter,ip6_tables,ebtables,ipt_MASQUERADE,iptable_nat,xt_state,ipt_REJECT,xt_CHECKSUM,iptable_mangle,xt_tcpudp,iptable_filter,ip_tables
dm_crypt   23126  0 
coretemp   13642  0 
kvm_intel 137888  0 
kvm   422160  1 kvm_intel
ext2   73799  1 
gpio_ich   13384  0 
dm_multipath   23306  0 
scsi_dh14589  1 dm_multipath
lp 17800  0 
microcode  23030  0 
parport46563  1 lp
bridge 91498  0 
sb_edac18104  0 
stp12977  1 bridge
llc14598  2 bridge,stp
joydev 17694  0 
edac_core  53053  3 sb_edac
shpchp 37278  0 
lpc_ich17145  0 
mei41410  0 
acpi_power_meter   18124  0 
mac_hid13254  0 
btrfs 781017  0 
zlib_deflate   27140  1 btrfs
libcrc32c  12645  1 btrfs
vesafb 13846  1 
hid_generic12541  0 
usbhid 47259  0 
hid   100815  2 hid_generic,usbhid
ses17418  0 
enclosure

vga passthrough // vfio // qemu bridge

2013-07-16 Thread Martin Wolf

Early 2012 i tested the old vga passthrough capabilities of KVM and was
partly successful.
now with the new vfio driver i tried again according to alex's hints and
this guide:
https://bbs.archlinux.org/viewtopic.php?id=162768

since im primarily using ubuntu i used the daily build of saucy.
it ships qemu 1.5 and seabios 1.7.3 so the requirements are met.

according to the guide i prepared the vga card (amd 7870)

[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
[0.00] Kernel command line:
BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
[0.569977] pci-stub: add 1002:6818 sub=:
cls=/
[0.569987] pci-stub :01:00.0: claimed by stub
[0.569994] pci-stub: add 1002:AAB0 sub=:
cls=/
[0.569998] pci-stub :01:00.1: claimed by stub

then did this just to be sure:
echo options vfio_iommu_type1 allow_unsafe_interrupts=1 
/etc/modprobe.d/vfio_iommu_type1.conf
(or was that wrong?)
im using a z87 haswell mainboard

after that i binded the two devices to vfio-pci with:
vfio-bind :01:00.0 :01:00.1 (the script in the guide)

afterwards i was able to start the kvm with
qemu-system-x86_64 -enable-kvm -M q35 -m 8192 -cpu host \
-smp 8,sockets=1,cores=4,threads=8 \
-bios /usr/share/qemu/bios.bin -vga none \
-device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
-device
vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
-device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 \
-device ahci,bus=pcie.0,id=ahci \
-drive file=/home/martin/windows.img,if=none,id=disk,format=raw -device
virtio-blk-pci,drive=disk \
-drive file=/home/martin/X17-59885.iso,id=isocd -device
ide-cd,bus=ahci.0,drive=isocd \
-net nic,model=virtio \
-net user \
-usb -usbdevice host:1532:000c \
-drive file=/home/martin/Downloads/virtio-win-0.1-59.iso,id=isocd1
-device ide-cd,bus=ahci.1,drive=isocd1

to my surprise i instantly got the windows installation running
installed the virtio drivers for nic and storage
and had 15 mins later a working win7 installation.
now i installed the amd driver (13.4) and rebooted.
i got a bluescreen. similar to my old expiriences so i thought do a
clean host reboot and try again.
but still the same. so i tried to load the bios.rom for the card (found
it on techpowerup) again no luck.
maybe someone knows a hint?

---

about qemu bridge
i tried to set up a bridge with the config but qemu always told me that
qemu-bridge-helper is not present.
all i found out that it propably got removed from the package because of
the lack of control over the tap
devices.
now my question how can i still bridge the vm into my network without
that helper?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vga passthrough // vfio // qemu bridge

2013-07-16 Thread Alex Williamson

On Tue, 2013-07-16 at 14:35 +0200, Martin Wolf wrote:
 Early 2012 i tested the old vga passthrough capabilities of KVM and was
 partly successful.
 now with the new vfio driver i tried again according to alex's hints and
 this guide:
 https://bbs.archlinux.org/viewtopic.php?id=162768
 
 since im primarily using ubuntu i used the daily build of saucy.
 it ships qemu 1.5 and seabios 1.7.3 so the requirements are met.
 
 according to the guide i prepared the vga card (amd 7870)
 
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
 root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
 pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
 [0.00] Kernel command line:
 BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
 root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
 pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
 [0.569977] pci-stub: add 1002:6818 sub=:
 cls=/
 [0.569987] pci-stub :01:00.0: claimed by stub
 [0.569994] pci-stub: add 1002:AAB0 sub=:
 cls=/
 [0.569998] pci-stub :01:00.1: claimed by stub
 
 then did this just to be sure:
 echo options vfio_iommu_type1 allow_unsafe_interrupts=1 
 /etc/modprobe.d/vfio_iommu_type1.conf
 (or was that wrong?)
 im using a z87 haswell mainboard

Hopefully not needed, only use this option if you need to.  vfio will
print an error to dmesg and qemu will fail to start if you need it.

 after that i binded the two devices to vfio-pci with:
 vfio-bind :01:00.0 :01:00.1 (the script in the guide)
 
 afterwards i was able to start the kvm with
 qemu-system-x86_64 -enable-kvm -M q35 -m 8192 -cpu host \
 -smp 8,sockets=1,cores=4,threads=8 \
 -bios /usr/share/qemu/bios.bin -vga none \
 -device
 ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
 -device
 vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
 -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 \
 -device ahci,bus=pcie.0,id=ahci \
 -drive file=/home/martin/windows.img,if=none,id=disk,format=raw -device
 virtio-blk-pci,drive=disk \
 -drive file=/home/martin/X17-59885.iso,id=isocd -device
 ide-cd,bus=ahci.0,drive=isocd \
 -net nic,model=virtio \
 -net user \
 -usb -usbdevice host:1532:000c \
 -drive file=/home/martin/Downloads/virtio-win-0.1-59.iso,id=isocd1
 -device ide-cd,bus=ahci.1,drive=isocd1
 
 to my surprise i instantly got the windows installation running
 installed the virtio drivers for nic and storage
 and had 15 mins later a working win7 installation.
 now i installed the amd driver (13.4) and rebooted.
 i got a bluescreen. similar to my old expiriences so i thought do a
 clean host reboot and try again.
 but still the same. so i tried to load the bios.rom for the card (found
 it on techpowerup) again no luck.
 maybe someone knows a hint?

I think most of the folks using the guide you reference are using my
vfio-vga-reset branches of qemu  kernel (or patches extracted from
them).  These add an important step for reproducibility, by being able
to reset the bus for the graphics card, giving us a way to reset the
device.  The other thing in the qemu branch are improved quirks.  I've
just posted these to qemu-devel and plan to get them pulled for 1.6.
Note that for ATI/AMD cards, a critical quirks is intercepting the byte
at I/O port address 0x3c3.  Without this, the VGA BIOS can't bootstrap
itself.  The vfio-vga-reset branch has a conditional replacement of
this, which doesn't seem to work for everyone.  I believe the version I
posted to qemu-devel yesterday is a better implementation of that quirk.

 ---
 
 about qemu bridge
 i tried to set up a bridge with the config but qemu always told me that
 qemu-bridge-helper is not present.
 all i found out that it propably got removed from the package because of
 the lack of control over the tap
 devices.
 now my question how can i still bridge the vm into my network without
 that helper?

I don't know what qemu-bridge-helper is/was, but you're probably better
off asking bridge questions in a separate thread instead of buried here.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

This is the first version for VMX nested environment test case. It
contains the basic VMX instructions test cases, including VMXON/
VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
also tests the basic execution routine in VMX nested environment and
let the VM print Hello World to inform its successfully run.

New files added:
x86/vmx.h : contains all VMX related macro declerations
x86/vmx.c : main file for VMX nested test case

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86-common.mak |2 +
 config-x86_64.mak |1 +
 lib/x86/msr.h |5 +
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  565 +
 x86/vmx.h |  406 +++
 7 files changed, 989 insertions(+)
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

diff --git a/config-x86-common.mak b/config-x86-common.mak
index 455032b..34a41e1 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 
 $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
+$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/config-x86_64.mak b/config-x86_64.mak
index 91ffcce..5d9b22a 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -11,5 +11,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
  $(TEST_DIR)/pcid.flat
 tests += $(TEST_DIR)/svm.flat
+tests += $(TEST_DIR)/vmx.flat
 
 include config-x86-common.mak
diff --git a/lib/x86/msr.h b/lib/x86/msr.h
index 509a421..281255a 100644
--- a/lib/x86/msr.h
+++ b/lib/x86/msr.h
@@ -396,6 +396,11 @@
 #define MSR_IA32_VMX_VMCS_ENUM  0x048a
 #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
 #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
+#define MSR_IA32_VMX_TRUE_PIN  0x048d
+#define MSR_IA32_VMX_TRUE_PROC 0x048e
+#define MSR_IA32_VMX_TRUE_EXIT 0x048f
+#define MSR_IA32_VMX_TRUE_ENTRY0x0490
+
 
 /* AMD-V MSRs */
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 24df5f8..0fe76da 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -4,6 +4,10 @@
 .globl boot_idt
 boot_idt = 0
 
+.globl idt_descr
+.globl tss_descr
+.globl gdt64_desc
+
 ipi_vector = 0x20
 
 max_cpus = 64
diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index bc9643e..85c36aa 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
 file = pcid.flat
 extra_params = -cpu qemu64,+pcid
 arch = x86_64
+
+[vmx]
+file = vmx.flat
+extra_params = -cpu host,+vmx
+arch = x86_64
+
diff --git a/x86/vmx.c b/x86/vmx.c
new file mode 100644
index 000..9082fc7
--- /dev/null
+++ b/x86/vmx.c
@@ -0,0 +1,565 @@
+#include libcflat.h
+#include processor.h
+#include vm.h
+#include desc.h
+#include vmx.h
+#include msr.h
+#include smp.h
+#include io.h
+#include setjmp64.h
+
+int fails = 0, tests = 0;
+u32 *vmxon_region;
+struct vmcs *vmcs_root;
+void *io_bmp1, *io_bmp2;
+void *msr_bmp;
+u32 vpid_ctr;
+char *guest_stack, *host_stack;
+char *guest_syscall_stack, *host_syscall_stack;
+u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
+ulong fix_cr0_set, fix_cr0_clr;
+ulong fix_cr4_set, fix_cr4_clr;
+struct regs regs;
+struct jmp_buf env;
+
+extern u64 gdt64_desc[];
+extern u64 idt_descr[];
+extern u64 tss_descr[];
+extern void *entry_vmx;
+extern void *entry_sysenter;
+extern void *entry_guest;
+
+void report(const char *name, int result)
+{
+   ++tests;
+   if (result)
+   printf(PASS: %s\n, name);
+   else {
+   printf(FAIL: %s\n, name);
+   ++fails;
+   }
+}
+
+inline u64 get_rflags(void)
+{
+   u64 r;
+   asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
+   return r;
+}
+
+inline void set_rflags(u64 r)
+{
+   asm volatile(push %0; popf\n\t : : q(r) : cc);
+}
+
+int vmcs_clear(struct vmcs *vmcs)
+{
+   bool ret;
+   asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+u64 vmcs_read(enum Encoding enc)
+{
+   u64 val;
+   asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
+   return val;
+}
+
+int vmcs_write(enum Encoding enc, u64 val)
+{
+   bool ret;
+   asm volatile (vmwrite %1, %2; setbe %0
+   : =q(ret) : rm (val), r ((u64)enc) : cc);
+   return ret;
+}
+
+int make_vmcs_current(struct vmcs *vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+int save_vmcs(struct vmcs **vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrst %1; setbe %0 : =q (ret) : m (*vmcs) : cc);
+   return ret;
+}
+
+/* entry_vmx */

[PATCH v2 0/2] Basic nested VMX test suite

2013-07-16 Thread Arthur Chunqi Li

These two patches are focued on providing a basic version of VMX nested
test suite. This commit provides the architecture of nested VMX similiar
to x86/svm.c.

setjmp/longjmp is designed to avoid exiting VMX in call-back form.

Arthur Chunqi Li (2):
  kvm-unit-tests : Add setjmp/longjmp to libcflat
  kvm-unit-tests : The first version of VMX nested test case

 config-x86-common.mak |2 +
 config-x86_64.mak |3 +
 lib/x86/msr.h |5 +
 lib/x86/setjmp64.S|   27 +++
 lib/x86/setjmp64.h|   15 ++
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  565 +
 x86/vmx.h |  406 +++
 9 files changed, 1033 insertions(+)
 create mode 100644 lib/x86/setjmp64.S
 create mode 100644 lib/x86/setjmp64.h
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] kvm-unit-tests : Add setjmp/longjmp to libcflat

2013-07-16 Thread Arthur Chunqi Li

Add setjmp and longjmp functions to libcflat. Now these two functions
are only supported in X86_64 arch.

New files added:
lib/x86/setjmp64.S
lib/x86/setjmp64.c

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86_64.mak  |2 ++
 lib/x86/setjmp64.S |   27 +++
 lib/x86/setjmp64.h |   15 +++
 3 files changed, 44 insertions(+)
 create mode 100644 lib/x86/setjmp64.S
 create mode 100644 lib/x86/setjmp64.h

diff --git a/config-x86_64.mak b/config-x86_64.mak
index 4e525f5..91ffcce 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -4,6 +4,8 @@ bits = 64
 ldarch = elf64-x86-64
 CFLAGS += -D__x86_64__
 
+cflatobjs += lib/x86/setjmp64.o
+
 tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/emulator.flat $(TEST_DIR)/idt_test.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
diff --git a/lib/x86/setjmp64.S b/lib/x86/setjmp64.S
new file mode 100644
index 000..c8ae790
--- /dev/null
+++ b/lib/x86/setjmp64.S
@@ -0,0 +1,27 @@
+.globl setjmp
+setjmp:
+   mov (%rsp), %rsi
+   mov %rsi, (%rdi)
+   mov %rsp, 0x8(%rdi)
+   mov %rbp, 0x10(%rdi)
+   mov %rbx, 0x18(%rdi)
+   mov %r12, 0x20(%rdi)
+   mov %r13, 0x28(%rdi)
+   mov %r14, 0x30(%rdi)
+   mov %r15, 0x38(%rdi)
+   xor %eax, %eax
+   ret
+
+.globl longjmp
+longjmp:
+   mov %esi, %eax
+   mov 0x38(%rdi), %r15
+   mov 0x30(%rdi), %r14
+   mov 0x28(%rdi), %r13
+   mov 0x20(%rdi), %r12
+   mov 0x18(%rdi), %rbx
+   mov 0x10(%rdi), %rbp
+   mov 0x8(%rdi), %rsp
+   mov (%rdi), %rsi
+   mov %rsi, (%rsp)
+   ret
diff --git a/lib/x86/setjmp64.h b/lib/x86/setjmp64.h
new file mode 100644
index 000..98ff16a
--- /dev/null
+++ b/lib/x86/setjmp64.h
@@ -0,0 +1,15 @@
+#ifndef LIBCFLAT_SETJMP64_H
+#define LIBCFLAT_SETJMP64_H
+
+#include libcflat.h
+
+struct jmp_buf {
+   u64 calling_rip;
+   u64 rsp, rbp, rbx;
+   u64 r12, r13, r14, r15;
+};
+
+void longjmp(struct jmp_buf *env, int val);
+int setjmp(struct jmp_buf *env);
+
+#endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] kvm-unit-tests : Add setjmp/longjmp to libcflat

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 17:11, Arthur Chunqi Li ha scritto:
 Add setjmp and longjmp functions to libcflat. Now these two functions
 are only supported in X86_64 arch.
 
 New files added:
   lib/x86/setjmp64.S
   lib/x86/setjmp64.c
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  config-x86_64.mak  |2 ++
  lib/x86/setjmp64.S |   27 +++
  lib/x86/setjmp64.h |   15 +++
  3 files changed, 44 insertions(+)
  create mode 100644 lib/x86/setjmp64.S
  create mode 100644 lib/x86/setjmp64.h
 
 diff --git a/config-x86_64.mak b/config-x86_64.mak
 index 4e525f5..91ffcce 100644
 --- a/config-x86_64.mak
 +++ b/config-x86_64.mak
 @@ -4,6 +4,8 @@ bits = 64
  ldarch = elf64-x86-64
  CFLAGS += -D__x86_64__
  
 +cflatobjs += lib/x86/setjmp64.o
 +
  tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 $(TEST_DIR)/emulator.flat $(TEST_DIR)/idt_test.flat \
 $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
 diff --git a/lib/x86/setjmp64.S b/lib/x86/setjmp64.S
 new file mode 100644
 index 000..c8ae790
 --- /dev/null
 +++ b/lib/x86/setjmp64.S
 @@ -0,0 +1,27 @@
 +.globl setjmp
 +setjmp:
 + mov (%rsp), %rsi
 + mov %rsi, (%rdi)
 + mov %rsp, 0x8(%rdi)
 + mov %rbp, 0x10(%rdi)
 + mov %rbx, 0x18(%rdi)
 + mov %r12, 0x20(%rdi)
 + mov %r13, 0x28(%rdi)
 + mov %r14, 0x30(%rdi)
 + mov %r15, 0x38(%rdi)
 + xor %eax, %eax
 + ret
 +
 +.globl longjmp
 +longjmp:
 + mov %esi, %eax
 + mov 0x38(%rdi), %r15
 + mov 0x30(%rdi), %r14
 + mov 0x28(%rdi), %r13
 + mov 0x20(%rdi), %r12
 + mov 0x18(%rdi), %rbx
 + mov 0x10(%rdi), %rbp
 + mov 0x8(%rdi), %rsp
 + mov (%rdi), %rsi
 + mov %rsi, (%rsp)
 + ret

Nice!

 diff --git a/lib/x86/setjmp64.h b/lib/x86/setjmp64.h
 new file mode 100644
 index 000..98ff16a
 --- /dev/null
 +++ b/lib/x86/setjmp64.h
 @@ -0,0 +1,15 @@
 +#ifndef LIBCFLAT_SETJMP64_H
 +#define LIBCFLAT_SETJMP64_H
 +
 +#include libcflat.h
 +
 +struct jmp_buf {
 + u64 calling_rip;
 + u64 rsp, rbp, rbx;
 + u64 r12, r13, r14, r15;
 +};

Please make it a typedef instead.  In fact, jmp_buf is typically an
array so that longjmp(env, 1) works without the ampersand.  A
struct-like declaration can go in setjmp64.S as a comment.

With this change, you can put it in lib/setjmp.h so that implementation
for other arches can be added later.

Otherwise looks good, thanks!

Paolo

 +void longjmp(struct jmp_buf *env, int val);
 +int setjmp(struct jmp_buf *env);
 +
 +#endif
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: vfio interface for platform devices (v2)

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Mario Smarduch [mailto:mario.smard...@huawei.com]
 Sent: Thursday, July 04, 2013 9:45 AM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Alexander Graf; Wood Scott-B07421; kvm@vger.kernel.org 
 list; Bhushan Bharat-R65777;
 kvm-...@vger.kernel.org; virtualizat...@lists.linux-foundation.org; Sethi 
 Varun-B16395;
 kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices (v2)

 I'm having trouble understanding how this works where
 the Guest Device Model != Host. How do you inform the guest
 where the device is mapped in its physical address space,
 and handle GPA faults?

The vfio mechanisms just expose hardware to user space
and the user space app may or may not QEMU.  So there
may be no 'guest' at all.

The intent of this RFC is to provide enough info to user space so
an application can use the device, or in the case of QEMU expose
the device to a VM.  Platform devices are typically exposed via
the device tree and that is how I envision them being presented
to a guest.

Are there real cases you see where guest device model != host?
I don't envision ever presenting a platform device as a PCI device
or vise versa.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

On Tue, Jul 16, 2013 at 11:20 PM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
  +void vmx_exit(void)
  +{
  +   test_vmxoff();
  +   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
  +   exit(fails ? -1 : 0);
  +}

 Can you try to jump back to main, and do test_vmxoff there?  This will
 avoid having to write our tests in callback style, which is a pain.
 Basically something similar to setjmp/longjmp.  In main:

   if (setjmp(jmpbuf) == 0) {
   vmx_run();
   /* Should not reach here */
   report(test vmlaunch, 0);
   }
   test_vmxoff();

 exit:
   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
   return fails ? 1 : 0;

 In vmx_handler:

   case VMX_HLT:
   printf(\nVM exit.\n);
   longjmp(jmpbuf, 1);

 Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
 will make code much more straightforward and easer to follow.
The concept easier to follow may have different meanings in
different view. This achievement puts all the test cases in main
function instead of scattering everywhere, which is another view to
easy to follow. As this is just a test case, I prefer this one.

Besides, this way we can start another VM following the previous one
simply in main function. This is flexible if we want to test re-enter
to VMX mode or so.

Arthur

 --
 Gleb.



--
Arthur Chunqi Li
Department of Computer Science
School of EECS
Peking University
Beijing, China
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
  +void vmx_exit(void)
  +{
  +   test_vmxoff();
  +   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
  +   exit(fails ? -1 : 0);
  +}
 
 Can you try to jump back to main, and do test_vmxoff there?  This will
 avoid having to write our tests in callback style, which is a pain.
 Basically something similar to setjmp/longjmp.  In main:
 
   if (setjmp(jmpbuf) == 0) {
   vmx_run();
   /* Should not reach here */
   report(test vmlaunch, 0);
   }
   test_vmxoff();
 
 exit:
   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
   return fails ? 1 : 0;
 
 In vmx_handler:
 
   case VMX_HLT:
   printf(\nVM exit.\n);
   longjmp(jmpbuf, 1);
 
Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
will make code much more straightforward and easer to follow.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Peter Zijlstra

On Tue, Jul 16, 2013 at 09:02:15AM +0300, Gleb Natapov wrote:
 BTW can NMI handler take spinlocks? 

No -- that is, yes you can using trylock, but you still shouldn't.

 If it can what happens if NMI is
 delivered in a section protected by local_irq_save()/local_irq_restore()?

You deadlock.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/2] Basic nested VMX test suite

2013-07-16 Thread Arthur Chunqi Li

These two patches are focued on providing a basic version of VMX nested
test suite. This commit provides the architecture of nested VMX similiar
to x86/svm.c.

setjmp/longjmp is designed to avoid exiting VMX in call-back form.

Arthur Chunqi Li (2):
  kvm-unit-tests : Add setjmp/longjmp to libcflat
  kvm-unit-tests : The first version of VMX nested test case

 config-x86-common.mak |2 +
 config-x86_64.mak |3 +
 lib/setjmp.h  |   11 +
 lib/x86/msr.h |5 +
 lib/x86/setjmp64.S|   27 +++
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  561 +
 x86/vmx.h |  406 +++
 9 files changed, 1025 insertions(+)
 create mode 100644 lib/setjmp.h
 create mode 100644 lib/x86/setjmp64.S
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/2] kvm-unit-tests : Add setjmp/longjmp to libcflat

2013-07-16 Thread Arthur Chunqi Li

Add setjmp and longjmp functions to libcflat. Now these two functions
are only supported in X86_64 arch.

New files added:
lib/x86/setjmp64.S
lib/x86/setjmp64.c

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86_64.mak  |2 ++
 lib/setjmp.h   |   11 +++
 lib/x86/setjmp64.S |   27 +++
 3 files changed, 40 insertions(+)
 create mode 100644 lib/setjmp.h
 create mode 100644 lib/x86/setjmp64.S

diff --git a/config-x86_64.mak b/config-x86_64.mak
index 4e525f5..91ffcce 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -4,6 +4,8 @@ bits = 64
 ldarch = elf64-x86-64
 CFLAGS += -D__x86_64__
 
+cflatobjs += lib/x86/setjmp64.o
+
 tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/emulator.flat $(TEST_DIR)/idt_test.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
diff --git a/lib/setjmp.h b/lib/setjmp.h
new file mode 100644
index 000..eca70d9
--- /dev/null
+++ b/lib/setjmp.h
@@ -0,0 +1,11 @@
+#ifndef LIBCFLAT_SETJMP64_H
+#define LIBCFLAT_SETJMP64_H
+
+#include libcflat.h
+
+typedef char jmp_buf[64];
+
+void longjmp(jmp_buf env, int val);
+int setjmp(jmp_buf env);
+
+#endif
diff --git a/lib/x86/setjmp64.S b/lib/x86/setjmp64.S
new file mode 100644
index 000..c8ae790
--- /dev/null
+++ b/lib/x86/setjmp64.S
@@ -0,0 +1,27 @@
+.globl setjmp
+setjmp:
+   mov (%rsp), %rsi
+   mov %rsi, (%rdi)
+   mov %rsp, 0x8(%rdi)
+   mov %rbp, 0x10(%rdi)
+   mov %rbx, 0x18(%rdi)
+   mov %r12, 0x20(%rdi)
+   mov %r13, 0x28(%rdi)
+   mov %r14, 0x30(%rdi)
+   mov %r15, 0x38(%rdi)
+   xor %eax, %eax
+   ret
+
+.globl longjmp
+longjmp:
+   mov %esi, %eax
+   mov 0x38(%rdi), %r15
+   mov 0x30(%rdi), %r14
+   mov 0x28(%rdi), %r13
+   mov 0x20(%rdi), %r12
+   mov 0x18(%rdi), %rbx
+   mov 0x10(%rdi), %rbp
+   mov 0x8(%rdi), %rsp
+   mov (%rdi), %rsi
+   mov %rsi, (%rsp)
+   ret
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/2] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

This is the first version for VMX nested environment test case. It
contains the basic VMX instructions test cases, including VMXON/
VMXOFF/VMXPTRLD/VMXPTRST/VMCLEAR/VMLAUNCH/VMRESUME/VMCALL. This patch
also tests the basic execution routine in VMX nested environment and
let the VM print Hello World to inform its successfully run.

New files added:
x86/vmx.h : contains all VMX related macro declerations
x86/vmx.c : main file for VMX nested test case

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 config-x86-common.mak |2 +
 config-x86_64.mak |1 +
 lib/x86/msr.h |5 +
 x86/cstart64.S|4 +
 x86/unittests.cfg |6 +
 x86/vmx.c |  561 +
 x86/vmx.h |  406 +++
 7 files changed, 985 insertions(+)
 create mode 100644 x86/vmx.c
 create mode 100644 x86/vmx.h

diff --git a/config-x86-common.mak b/config-x86-common.mak
index 455032b..34a41e1 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -101,6 +101,8 @@ $(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 
 $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
+$(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/config-x86_64.mak b/config-x86_64.mak
index 91ffcce..5d9b22a 100644
--- a/config-x86_64.mak
+++ b/config-x86_64.mak
@@ -11,5 +11,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
  $(TEST_DIR)/xsave.flat $(TEST_DIR)/rmap_chain.flat \
  $(TEST_DIR)/pcid.flat
 tests += $(TEST_DIR)/svm.flat
+tests += $(TEST_DIR)/vmx.flat
 
 include config-x86-common.mak
diff --git a/lib/x86/msr.h b/lib/x86/msr.h
index 509a421..281255a 100644
--- a/lib/x86/msr.h
+++ b/lib/x86/msr.h
@@ -396,6 +396,11 @@
 #define MSR_IA32_VMX_VMCS_ENUM  0x048a
 #define MSR_IA32_VMX_PROCBASED_CTLS20x048b
 #define MSR_IA32_VMX_EPT_VPID_CAP   0x048c
+#define MSR_IA32_VMX_TRUE_PIN  0x048d
+#define MSR_IA32_VMX_TRUE_PROC 0x048e
+#define MSR_IA32_VMX_TRUE_EXIT 0x048f
+#define MSR_IA32_VMX_TRUE_ENTRY0x0490
+
 
 /* AMD-V MSRs */
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index 24df5f8..0fe76da 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -4,6 +4,10 @@
 .globl boot_idt
 boot_idt = 0
 
+.globl idt_descr
+.globl tss_descr
+.globl gdt64_desc
+
 ipi_vector = 0x20
 
 max_cpus = 64
diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index bc9643e..85c36aa 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -149,3 +149,9 @@ extra_params = --append 1000 `date +%s`
 file = pcid.flat
 extra_params = -cpu qemu64,+pcid
 arch = x86_64
+
+[vmx]
+file = vmx.flat
+extra_params = -cpu host,+vmx
+arch = x86_64
+
diff --git a/x86/vmx.c b/x86/vmx.c
new file mode 100644
index 000..b858400
--- /dev/null
+++ b/x86/vmx.c
@@ -0,0 +1,561 @@
+#include libcflat.h
+#include processor.h
+#include vm.h
+#include desc.h
+#include vmx.h
+#include msr.h
+#include smp.h
+#include io.h
+#include setjmp.h
+
+int fails = 0, tests = 0;
+u32 *vmxon_region;
+struct vmcs *vmcs_root;
+void *io_bmp1, *io_bmp2;
+void *msr_bmp;
+u32 vpid_ctr;
+char *guest_stack, *host_stack;
+char *guest_syscall_stack, *host_syscall_stack;
+u32 ctrl_pin, ctrl_enter, ctrl_exit, ctrl_cpu[2];
+ulong fix_cr0_set, fix_cr0_clr;
+ulong fix_cr4_set, fix_cr4_clr;
+struct regs regs;
+jmp_buf env;
+
+extern u64 gdt64_desc[];
+extern u64 idt_descr[];
+extern u64 tss_descr[];
+extern void *entry_vmx;
+extern void *entry_sysenter;
+extern void *entry_guest;
+
+void report(const char *name, int result)
+{
+   ++tests;
+   if (result)
+   printf(PASS: %s\n, name);
+   else {
+   printf(FAIL: %s\n, name);
+   ++fails;
+   }
+}
+
+inline u64 get_rflags(void)
+{
+   u64 r;
+   asm volatile(pushf; pop %0\n\t : =q(r) : : cc);
+   return r;
+}
+
+inline void set_rflags(u64 r)
+{
+   asm volatile(push %0; popf\n\t : : q(r) : cc);
+}
+
+int vmcs_clear(struct vmcs *vmcs)
+{
+   bool ret;
+   asm volatile (vmclear %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+u64 vmcs_read(enum Encoding enc)
+{
+   u64 val;
+   asm volatile (vmread %1, %0 : =rm (val) : r ((u64)enc) : cc);
+   return val;
+}
+
+int vmcs_write(enum Encoding enc, u64 val)
+{
+   bool ret;
+   asm volatile (vmwrite %1, %2; setbe %0
+   : =q(ret) : rm (val), r ((u64)enc) : cc);
+   return ret;
+}
+
+int make_vmcs_current(struct vmcs *vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrld %1; setbe %0 : =q (ret) : m (vmcs) : cc);
+   return ret;
+}
+
+int save_vmcs(struct vmcs **vmcs)
+{
+   bool ret;
+
+   asm volatile (vmptrst %1; setbe %0 : =q (ret) : m (*vmcs) : cc);
+   return ret;
+}
+
+/* entry_vmx */
+asm(
+

Re: [PATCH v3 0/2] Basic nested VMX test suite

2013-07-16 Thread Arthur Chunqi Li

On this commit, I delete two TODO lines in PATCH 2/2, and it can
only run with all my previous patches.

Arthur

On Tue, Jul 16, 2013 at 11:58 PM, Arthur Chunqi Li yzt...@gmail.com wrote:
 These two patches are focued on providing a basic version of VMX nested
 test suite. This commit provides the architecture of nested VMX similiar
 to x86/svm.c.

 setjmp/longjmp is designed to avoid exiting VMX in call-back form.

 Arthur Chunqi Li (2):
   kvm-unit-tests : Add setjmp/longjmp to libcflat
   kvm-unit-tests : The first version of VMX nested test case

  config-x86-common.mak |2 +
  config-x86_64.mak |3 +
  lib/setjmp.h  |   11 +
  lib/x86/msr.h |5 +
  lib/x86/setjmp64.S|   27 +++
  x86/cstart64.S|4 +
  x86/unittests.cfg |6 +
  x86/vmx.c |  561 
 +
  x86/vmx.h |  406 +++
  9 files changed, 1025 insertions(+)
  create mode 100644 lib/setjmp.h
  create mode 100644 lib/x86/setjmp64.S
  create mode 100644 x86/vmx.c
  create mode 100644 x86/vmx.h

 --
 1.7.9.5




-- 
Arthur Chunqi Li
Department of Computer Science
School of EECS
Peking University
Beijing, China
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 05:48:52PM +0200, Peter Zijlstra wrote:
 On Tue, Jul 16, 2013 at 09:02:15AM +0300, Gleb Natapov wrote:
  BTW can NMI handler take spinlocks? 
 
 No -- that is, yes you can using trylock, but you still shouldn't.
 
Great news for this code. Thanks.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Gleb Natapov

On Tue, Jul 16, 2013 at 11:29:20PM +0800, Arthur Chunqi Li wrote:
 On Tue, Jul 16, 2013 at 11:20 PM, Gleb Natapov g...@redhat.com wrote:
  On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
   +void vmx_exit(void)
   +{
   +   test_vmxoff();
   +   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
   +   exit(fails ? -1 : 0);
   +}
 
  Can you try to jump back to main, and do test_vmxoff there?  This will
  avoid having to write our tests in callback style, which is a pain.
  Basically something similar to setjmp/longjmp.  In main:
 
if (setjmp(jmpbuf) == 0) {
vmx_run();
/* Should not reach here */
report(test vmlaunch, 0);
}
test_vmxoff();
 
  exit:
printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
return fails ? 1 : 0;
 
  In vmx_handler:
 
case VMX_HLT:
printf(\nVM exit.\n);
longjmp(jmpbuf, 1);
 
  Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
  will make code much more straightforward and easer to follow.
 The concept easier to follow may have different meanings in
 different view. This achievement puts all the test cases in main
 function instead of scattering everywhere, which is another view to
 easy to follow. As this is just a test case, I prefer this one.
 
I do not see why what I propose will prevent you to put all tests into main.

vmx_run() will looks like that:

   vmlaunch
   while(1) {
   vmresume
  vmexit jumps here
   switch(exit reason) {
  case reason1:
  break;
  case reason2:
  break;
  case HLT
  return;
   }
   }  
  
 Besides, this way we can start another VM following the previous one
 simply in main function. This is flexible if we want to test re-enter
 to VMX mode or so.
 
That's what I am missing. How do one writes more tests now?

I was thinking about interface like that:

guest_func_test1()
{
}

tes1t_exit_handlers[] = {test1_handle_hlt, test1_handle_exception, }

main()
{

   init_vmcs(); /* generic stuff */
   init_vmcs_test1(); /* test1 related stuff */
   r = run_in_guest(guest_func_test1, test1_exit_handlers);
   report(test1, r);
}

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Arthur Chunqi Li

On Wed, Jul 17, 2013 at 12:45 AM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jul 16, 2013 at 11:29:20PM +0800, Arthur Chunqi Li wrote:
 On Tue, Jul 16, 2013 at 11:20 PM, Gleb Natapov g...@redhat.com wrote:
  On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
   +void vmx_exit(void)
   +{
   +   test_vmxoff();
   +   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
   +   exit(fails ? -1 : 0);
   +}
 
  Can you try to jump back to main, and do test_vmxoff there?  This will
  avoid having to write our tests in callback style, which is a pain.
  Basically something similar to setjmp/longjmp.  In main:
 
if (setjmp(jmpbuf) == 0) {
vmx_run();
/* Should not reach here */
report(test vmlaunch, 0);
}
test_vmxoff();
 
  exit:
printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
return fails ? 1 : 0;
 
  In vmx_handler:
 
case VMX_HLT:
printf(\nVM exit.\n);
longjmp(jmpbuf, 1);
 
  Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
  will make code much more straightforward and easer to follow.
 The concept easier to follow may have different meanings in
 different view. This achievement puts all the test cases in main
 function instead of scattering everywhere, which is another view to
 easy to follow. As this is just a test case, I prefer this one.

 I do not see why what I propose will prevent you to put all tests into main.

 vmx_run() will looks like that:

vmlaunch
while(1) {
vmresume
   vmexit jumps here
switch(exit reason) {
   case reason1:
   break;
   case reason2:
   break;
   case HLT
   return;
}
}
Yes, this recalls me some KVM codes I have read before. This mixes
vmlaunch/resume and vmx_handler into one piece of code. It is a good
way to explicitly show the execution sequence though, it increases LOC
in one function.

 Besides, this way we can start another VM following the previous one
 simply in main function. This is flexible if we want to test re-enter
 to VMX mode or so.

 That's what I am missing. How do one writes more tests now?

 I was thinking about interface like that:

 guest_func_test1()
 {
 }

 tes1t_exit_handlers[] = {test1_handle_hlt, test1_handle_exception, }

 main()
 {

init_vmcs(); /* generic stuff */
init_vmcs_test1(); /* test1 related stuff */
r = run_in_guest(guest_func_test1, test1_exit_handlers);
report(test1, r);
 }

I have thought about this question and I'm not quite sure how to solve
it now. I have two ways. The first is that we just leave vmx.c as the
VMX instructions and execution routine test suite, and develop other
test cases in other files. Since all other tests of nested vmx is
independent to the basic routine and it is hard for us to put all test
cases for nested VMX in one file, so we just let this file do simple
things and reuse some of its functions in other test suites of nested
vmx. Your proposal of adding new test cases can be implemented in
other test suites.

The other way is not splitting nested vmx tests cases in contrast.
This method may cause a HUGE vmx.c file, and tests for different parts
are not distinctive.

Actually, I prefer the former solution.
 --
 Gleb.


Besides, there is also another pseudo bug in PATCH 2/2, here:

+int test_vmx_capability(void)
+{
+   struct cpuid r;
+   u64 ret1, ret2;
+   r = cpuid(1);
+   ret1 = ((r.c)  5)  1;
+   ret2 = ((rdmsr(MSR_IA32_FEATURE_CONTROL)  0x5) == 0x5);
+   report(test vmx capability, ret1  ret2);
+   return !(ret1  ret2);
+}

The IA32_FEATURE_CONTROL MSR should be set by seabios. SInce there's
no patch for seabios and in fact software can also set this MSR if its
lock bit is unset. So I prefer to change it like this:

+int test_vmx_capability(void)
+{
+   struct cpuid r;
+   u64 ret1, ret2;
+   u64 ia32_feature_control;
+   r = cpuid(1);
+   ret1 = ((r.c)  5)  1;
+   ia32_feature_control = rdmsr(MSR_IA32_FEATURE_CONTROL);
+   ret2 = ((ia32_feature_control  0x5) == 0x5);
+   if ((!ret2)  ((ia32_feature_control  0x1) == 0)){
+   wrmsr(MSR_IA32_FEATURE_CONTROL, 0x5);
+   ia32_feature_control = rdmsr(MSR_IA32_FEATURE_CONTROL);
+   ret2 = ((ia32_feature_control  0x5) == 0x5);
+   }
+   report(test vmx capability, ret1  ret2);
+   return !(ret1  ret2);
+}


Arthur
--
Arthur Chunqi Li
Department of Computer Science
School of EECS
Peking University
Beijing, China
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Paolo Bonzini

Il 16/07/2013 19:13, Arthur Chunqi Li ha scritto:
 On Wed, Jul 17, 2013 at 12:45 AM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jul 16, 2013 at 11:29:20PM +0800, Arthur Chunqi Li wrote:
 On Tue, Jul 16, 2013 at 11:20 PM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
 +void vmx_exit(void)
 +{
 +   test_vmxoff();
 +   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
 +   exit(fails ? -1 : 0);
 +}

 Can you try to jump back to main, and do test_vmxoff there?  This will
 avoid having to write our tests in callback style, which is a pain.
 Basically something similar to setjmp/longjmp.  In main:

   if (setjmp(jmpbuf) == 0) {
   vmx_run();
   /* Should not reach here */
   report(test vmlaunch, 0);
   }
   test_vmxoff();

 exit:
   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
   return fails ? 1 : 0;

 In vmx_handler:

   case VMX_HLT:
   printf(\nVM exit.\n);
   longjmp(jmpbuf, 1);

 Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
 will make code much more straightforward and easer to follow.
 The concept easier to follow may have different meanings in
 different view. This achievement puts all the test cases in main
 function instead of scattering everywhere, which is another view to
 easy to follow. As this is just a test case, I prefer this one.

 I do not see why what I propose will prevent you to put all tests into main.

 vmx_run() will looks like that:

vmlaunch
while(1) {
vmresume
   vmexit jumps here
switch(exit reason) {
   case reason1:
   break;
   case reason2:
   break;
   case HLT
   return;
}
}
 Yes, this recalls me some KVM codes I have read before. This mixes
 vmlaunch/resume and vmx_handler into one piece of code. It is a good
 way to explicitly show the execution sequence though, it increases LOC
 in one function.

 Besides, this way we can start another VM following the previous one
 simply in main function. This is flexible if we want to test re-enter
 to VMX mode or so.

 That's what I am missing. How do one writes more tests now?

 I was thinking about interface like that:

 guest_func_test1()
 {
 }

 tes1t_exit_handlers[] = {test1_handle_hlt, test1_handle_exception, }

 main()
 {

init_vmcs(); /* generic stuff */
init_vmcs_test1(); /* test1 related stuff */
r = run_in_guest(guest_func_test1, test1_exit_handlers);
report(test1, r);
 }

 I have thought about this question and I'm not quite sure how to solve
 it now.

Why can't you just use a different vmx_handler (e.g. with an indirect
call in entry_vmx) for each test (as in Gleb's test1_exit_handlers)?
run_in_guest would prepare the function pointers and do

init_vmcs(vmcs_root);

if (setjmp(env) == 0){
vmx_run();
/* Should not reach here */
report(test vmlaunch, 0);
}

as in your current testcase.

vmx.c would be a library, and testcases could be either grouped in the
same file or spread across many of them, as you see fit.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-unit-tests : The first version of VMX nested test case

2013-07-16 Thread Gleb Natapov

On Wed, Jul 17, 2013 at 01:13:56AM +0800, Arthur Chunqi Li wrote:
 On Wed, Jul 17, 2013 at 12:45 AM, Gleb Natapov g...@redhat.com wrote:
  On Tue, Jul 16, 2013 at 11:29:20PM +0800, Arthur Chunqi Li wrote:
  On Tue, Jul 16, 2013 at 11:20 PM, Gleb Natapov g...@redhat.com wrote:
   On Tue, Jul 16, 2013 at 12:28:05PM +0200, Paolo Bonzini wrote:
+void vmx_exit(void)
+{
+   test_vmxoff();
+   printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
+   exit(fails ? -1 : 0);
+}
  
   Can you try to jump back to main, and do test_vmxoff there?  This will
   avoid having to write our tests in callback style, which is a pain.
   Basically something similar to setjmp/longjmp.  In main:
  
 if (setjmp(jmpbuf) == 0) {
 vmx_run();
 /* Should not reach here */
 report(test vmlaunch, 0);
 }
 test_vmxoff();
  
   exit:
 printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
 return fails ? 1 : 0;
  
   In vmx_handler:
  
 case VMX_HLT:
 printf(\nVM exit.\n);
 longjmp(jmpbuf, 1);
  
   Why not just make vmexit occur after vmlaunch/vmresume like KVM does. It
   will make code much more straightforward and easer to follow.
  The concept easier to follow may have different meanings in
  different view. This achievement puts all the test cases in main
  function instead of scattering everywhere, which is another view to
  easy to follow. As this is just a test case, I prefer this one.
 
  I do not see why what I propose will prevent you to put all tests into main.
 
  vmx_run() will looks like that:
 
 vmlaunch
 while(1) {
 vmresume
    vmexit jumps here
 switch(exit reason) {
case reason1:
break;
case reason2:
break;
case HLT
return;
 }
 }
 Yes, this recalls me some KVM codes I have read before. This mixes
 vmlaunch/resume and vmx_handler into one piece of code. It is a good
 way to explicitly show the execution sequence though, it increases LOC
 in one function.
LOC in one function is not an issue to be considered at all. Besides you
can put the switch into separate vmx_handler() function, or have an
array of vmexits.

 
  Besides, this way we can start another VM following the previous one
  simply in main function. This is flexible if we want to test re-enter
  to VMX mode or so.
 
  That's what I am missing. How do one writes more tests now?
 
  I was thinking about interface like that:
 
  guest_func_test1()
  {
  }
 
  tes1t_exit_handlers[] = {test1_handle_hlt, test1_handle_exception, }
 
  main()
  {
 
 init_vmcs(); /* generic stuff */
 init_vmcs_test1(); /* test1 related stuff */
 r = run_in_guest(guest_func_test1, test1_exit_handlers);
 report(test1, r);
  }
 
 I have thought about this question and I'm not quite sure how to solve
 it now. I have two ways. The first is that we just leave vmx.c as the
 VMX instructions and execution routine test suite, and develop other
 test cases in other files. Since all other tests of nested vmx is
 independent to the basic routine and it is hard for us to put all test
 cases for nested VMX in one file, so we just let this file do simple
 things and reuse some of its functions in other test suites of nested
 vmx. Your proposal of adding new test cases can be implemented in
 other test suites.
 
 The other way is not splitting nested vmx tests cases in contrast.
 This method may cause a HUGE vmx.c file, and tests for different parts
 are not distinctive.
 
 Actually, I prefer the former solution.
I do not think we need separate infrastructure just to test basic
instructions. Actually testing those are less interesting part (well
just to sorely test vmlaunch/vmresume we will likely have to write
hundred of tests to verify that all things that should cause failure do
that, but we likely settle for only a couple). I am not worrying about
vmx.c become a huge file as long as writing test is easy and more or
less self contained task.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V10 16/18] kvm hypervisor : Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic

2013-07-16 Thread Raghavendra K T


On 07/15/2013 09:16 PM, Gleb Natapov wrote:

On Mon, Jul 15, 2013 at 09:06:13PM +0530, Raghavendra K T wrote:

On 07/14/2013 06:54 PM, Gleb Natapov wrote:

On Mon, Jun 24, 2013 at 06:13:53PM +0530, Raghavendra K T wrote:

Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Note that we are using APIC_DM_REMRD which has reserved usage.
In future if APIC_DM_REMRD usage is standardized, then we should
find some other way or go back to old method.

Suggested-by: Gleb Natapov g...@redhat.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
  arch/x86/kvm/lapic.c |5 -
  arch/x86/kvm/x86.c   |   25 ++---
  2 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e1adbb4..3f5f82e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -706,7 +706,10 @@ out:
break;

case APIC_DM_REMRD:
-   apic_debug(Ignoring delivery mode 3\n);
+   result = 1;
+   vcpu-arch.pv.pv_unhalted = 1;
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+   kvm_vcpu_kick(vcpu);
break;

case APIC_DM_SMI:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92a9932..b963c86 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5456,27 +5456,14 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
   */
  static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
  {
-   struct kvm_vcpu *vcpu = NULL;
-   int i;
+   struct kvm_lapic_irq lapic_irq;

-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!kvm_apic_present(vcpu))
-   continue;
+   lapic_irq.shorthand = 0;
+   lapic_irq.dest_mode = 0;
+   lapic_irq.dest_id = apicid;

-   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
-   break;
-   }
-   if (vcpu) {
-   /*
-* Setting unhalt flag here can result in spurious runnable
-* state when unhalt reset does not happen in vcpu_block.
-* But that is harmless since that should soon result in halt.
-*/
-   vcpu-arch.pv.pv_unhalted = true;
-   /* We need everybody see unhalt before vcpu unblocks */
-   smp_wmb();
-   kvm_vcpu_kick(vcpu);
-   }
+   lapic_irq.delivery_mode = APIC_DM_REMRD;

Need to make sure that delivery_mode cannot be set to APIC_DM_REMRD

from MSI/IOAPIC/IPI path.

I Gleb,
I need your help here since I do not know much about apic.

so are you saying explicitly checking that, kvm_set_msi_irq,
apic_send_ipi, native_setup_ioapic_entry, does not have
APIC_DM_REMRD as delivery_mode set?


Yes, but on a second thought what bad can happen if we will not check it?
If guest configures irq to inject APIC_DM_REMRD interrupt this may
needlessly wakeup sleeping vcpu and it will try to accrue spinlock
again, so no correctness problem only performance. If this is the case
lets leave it as it for now.


Okay.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Raghavendra K T


On 07/16/2013 11:32 AM, Gleb Natapov wrote:

On Tue, Jul 16, 2013 at 09:07:53AM +0530, Raghavendra K T wrote:

On 07/15/2013 04:06 PM, Gleb Natapov wrote:

On Mon, Jul 15, 2013 at 03:20:06PM +0530, Raghavendra K T wrote:

On 07/14/2013 06:42 PM, Gleb Natapov wrote:

On Mon, Jun 24, 2013 at 06:13:42PM +0530, Raghavendra K T wrote:

kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

From: Srivatsa Vaddagiri va...@linux.vnet.ibm.com


trimming
[...]

+
+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
+{
+   struct kvm_lock_waiting *w;
+   int cpu;
+   u64 start;
+   unsigned long flags;
+
+   w = __get_cpu_var(lock_waiting);
+   cpu = smp_processor_id();
+   start = spin_time_start();
+
+   /*
+* Make sure an interrupt handler can't upset things in a
+* partially setup state.
+*/
+   local_irq_save(flags);
+
+   /*
+* The ordering protocol on this is that the lock pointer
+* may only be set non-NULL if the want ticket is correct.
+* If we're updating want, we must first clear lock.
+*/
+   w-lock = NULL;
+   smp_wmb();
+   w-want = want;
+   smp_wmb();
+   w-lock = lock;
+
+   add_stats(TAKEN_SLOW, 1);
+
+   /*
+* This uses set_bit, which is atomic but we should not rely on its
+* reordering gurantees. So barrier is needed after this call.
+*/
+   cpumask_set_cpu(cpu, waiting_cpus);
+
+   barrier();
+
+   /*
+* Mark entry to slowpath before doing the pickup test to make
+* sure we don't deadlock with an unlocker.
+*/
+   __ticket_enter_slowpath(lock);
+
+   /*
+* check again make sure it didn't become free while
+* we weren't looking.
+*/
+   if (ACCESS_ONCE(lock-tickets.head) == want) {
+   add_stats(TAKEN_SLOW_PICKUP, 1);
+   goto out;
+   }
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+

So what happens if an interrupt comes here and an interrupt handler
takes another spinlock that goes into the slow path? As far as I see
lock_waiting will become overwritten and cpu will be cleared from
waiting_cpus bitmap by nested kvm_lock_spinning(), so when halt is
called here after returning from the interrupt handler nobody is going
to wake this lock holder. Next random interrupt will fix it, but it
may be several milliseconds away, or never. We should probably check
if interrupt were enabled and call native_safe_halt() here.



Okay you mean something like below should be done.
if irq_enabled()
   native_safe_halt()
else
   halt()

It is been a complex stuff for analysis for me.

So in our discussion stack would looking like this.

spinlock()
   kvm_lock_spinning()
   -- interrupt here
   halt()


 From the halt if we trace


It is to early to trace the halt since it was not executed yet. Guest
stack trace will look something like this:

spinlock(a)
   kvm_lock_spinning(a)
lock_waiting = a
set bit in waiting_cpus
 -- interrupt here
 spinlock(b)
   kvm_lock_spinning(b)
 lock_waiting = b
 set bit in waiting_cpus
 halt()
 unset bit in waiting_cpus
 lock_waiting = NULL
  -- ret from interrupt
halt()

Now at the time of the last halt above lock_waiting == NULL and
waiting_cpus is empty and not interrupt it pending, so who will unhalt
the waiter?



Yes. if an interrupt occurs between
local_irq_restore() and halt(), this is possible. and since this is
rarest of rare (possiility of irq entering slowpath and then no
random irq to do spurious wakeup), we had never hit this problem in
the past.

I do not think it is very rare to get interrupt between
local_irq_restore() and halt() under load since any interrupt that
occurs between local_irq_save() and local_irq_restore() will be delivered
immediately after local_irq_restore(). Of course the chance of no other
random interrupt waking lock waiter is very low, but waiter can sleep
for much longer then needed and this will be noticeable in performance.


Yes, I meant the entire thing. I did infact turned WARN on w-lock==null 
before halt() [ though we can potentially have irq right after that ], 
but did not hit so far.



BTW can NMI handler take spinlocks? If it can what happens if NMI is
delivered in a section protected by local_irq_save()/local_irq_restore()?



Had another idea if NMI, halts are causing problem until I saw PeterZ's 
reply similar to V2 of pvspinlock posted  here:


 https://lkml.org/lkml/2011/10/23/211

Instead of halt we started with a sleep hypercall in those
 versions. Changed to halt() once Avi suggested to reuse existing sleep.

If we use older hypercall with few changes like below:

kvm_pv_wait_for_kick_op(flags,

Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Raghavendra K T


On 07/16/2013 09:18 PM, Peter Zijlstra wrote:

On Tue, Jul 16, 2013 at 09:02:15AM +0300, Gleb Natapov wrote:

BTW can NMI handler take spinlocks?


No -- that is, yes you can using trylock, but you still shouldn't.


Thanks Peter for the clarification.

I had started checking few of nmi handlers code to confirm.
Did saw a raw spinlock in drivers/acpi/apei/ghes.c, but then stopped.




If it can what happens if NMI is
delivered in a section protected by local_irq_save()/local_irq_restore()?


You deadlock.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: vfio interface for platform devices

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Wood Scott-B07421
 Sent: Wednesday, July 03, 2013 5:32 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Alexander Graf; Wood Scott-B07421; Bhushan 
 Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis; 
 kvm@vger.kernel.org list; kvm-
 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 On 07/02/2013 06:25:59 PM, Yoder Stuart-B08248 wrote:
  The write-up below is the first draft of a proposal for how the
  kernel can expose
  platform devices to user space using vfio.

  In short, I'm proposing a new ioctl VFIO_DEVICE_GET_DEVTREE_INFO which
  allows user space to correlate regions and interrupts to the
  corresponding
  device tree node structure that is defined for most platform devices.

  Regards,
  Stuart Yoder

  --
  VFIO for Platform Devices

  The existing infrastructure for vfio-pci is pretty close to what we
  need:
 -mechanism to create a container
 -add groups/devices to a container
 -set the IOMMU model
 -map DMA regions
 -get an fd for a specific device, which allows user space to
  determine
  info about device regions (e.g. registers) and interrupt info
 -support for mmapping device regions
 -mechanism to set how interrupts are signaled

  Platform devices can get complicated-- potentially with a tree
  hierarchy
  of nodes, and links/phandles pointing to other platform
  devices.   The kernel doesn't expose relationships between
  devices.  The kernel just exposes mappable register regions and
  interrupts.
  It's up to user space to work out relationships between devices
  if it needs to-- this can be determined in the device tree exposed in
  /proc/device-tree.

  I think the changes needed for vfio are around some of the device tree
  related info that needs to be available with the device fd.

  1.  VFIO_GROUP_GET_DEVICE_FD

User space has to know which device it is accessing and will call
VFIO_GROUP_GET_DEVICE_FD passing a specific platform device path to
get the device information:

fd = ioctl(group, VFIO_GROUP_GET_DEVICE_FD,
  /soc@ffe00/usb@21);

(whether the path is a device tree path or a sysfs path is up for
discussion, e.g. /sys/bus/platform/devices/ffe21.usb)

 Doesn't VFIO need to operate on an actual Linux device, rather than
 just an OF node?

 Are we going to have a fixed assumption that you always want all the
 children of the node corresponding to the assigned device, or will it
 be possible to exclude some?

My assumption is that you always get all the children of the
node corresponding to the assigned device.

  2.  VFIO_DEVICE_GET_INFO

 Don't think any changes are needed to VFIO_DEVICE_GET_INFO other
 than adding a new flag identifying a devices as a 'platform'
 device.

 This ioctl simply returns the number of regions and number of irqs.

 The number of regions corresponds to the number of regions
 that can be mapped for the device-- corresponds to the regions
  defined
 in reg and ranges in the device tree.

  3.  VFIO_DEVICE_GET_REGION_INFO

 No changes needed, except perhaps adding a new flag.  Freescale
  has some
 devices with regions that must be mapped cacheable.

 While I don't object to making the information available to the user
 just in case, the main thing we need here is to influence what the
 kernel does when the user tries to map it.  At least on PPC it's not up
 to userspace to select whether a mmap is cacheable.

If user space really can't do anything with the 'cacheable'
flag, do you think there is good reason to keep it?   Will it
help any decision that user space makes?  Maybe we should just
drop it.

  4. VFIO_DEVICE_GET_DEVTREE_INFO

 The VFIO_DEVICE_GET_REGION_INFO and VFIO_DEVICE_GET_IRQ_INFO APIs
 expose device regions and interrupts, but it's not enough to know
 that there are X regions and Y interrupts.  User space needs to
 know what the resources are for-- to correlate those
  regions/interrupts
 to the device tree structure that drivers use.  The device tree
 structure could consist of multiple nodes and it is necessary to
 identify the node corresponding to the region/interrupt exposed
 by VFIO.

 The following information is needed:
-the device tree path to the node corresponding to the
 region or interrupt
-for a region, whether it corresponds to a reg or ranges
 property
-there could be multiple sub-regions per reg or ranges and
 the sub-index within the reg/ranges is needed

 The VFIO_DEVICE_GET_DEVTREE_INFO operates on a device fd.

 ioctl: VFIO_DEVICE_GET_DEVTREE_INFO

 struct vfio_path_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_DEVTREE_INFO_RANGES  (1  3) /* the

RE: RFC: vfio interface for platform devices (v2)

2013-07-16 Thread Yoder Stuart-B08248

(sorry for the delayed response, but I've been on PTO)

  1.  VFIO_GROUP_GET_DEVICE_FD
 
User space knows by out-of-band means which device it is accessing
and will call VFIO_GROUP_GET_DEVICE_FD passing a specific sysfs path
to get the device information:
 
fd = ioctl(group, VFIO_GROUP_GET_DEVICE_FD,
   /sys/bus/platform/devices/ffe21.usb));
 
 FWIW, I'm in favor of whichever way works out cleaner in the code for
 pre-pending /sys/bus or not.  It sort of seems like it's unnecessary.
 It's also a little inconsistent that the returned path doesn't
 pre-pend /sys in the examples below.

Ok.  For the returned path in the examples I have the actual device tree
path which is slightly different from the path in /sys.  The device
tree path is what user space would need to interpret /proc/device-tree.

  2.  VFIO_DEVICE_GET_INFO
 
 The number of regions corresponds to the regions defined
 in reg and ranges in the device tree.
 
 Two new flags are added to struct vfio_device_info:
 
 #define VFIO_DEVICE_FLAGS_PLATFORM (1  ?) /* A platform bus device */
 #define VFIO_DEVICE_FLAGS_DEVTREE  (1  ?) /* device tree info 
  available */
 
 It is possible that there could be platform bus devices
 that are not in the device tree, so we use 2 flags to
 allow for that.
 
 If just VFIO_DEVICE_FLAGS_PLATFORM is set, it means
 that there are regions and IRQs but no device tree info
 available.
 
 If just VFIO_DEVICE_FLAGS_DEVTREE is set, it means
 there is device tree info available.
 
 But it would be invalid to only have DEVTREE w/o PLATFORM for now,
 right?

Right.  The way I stated it is incorrect. DEVTREE would never
be set by itself.

  3. VFIO_DEVICE_GET_REGION_INFO
 
 For platform devices with multiple regions, information
 is needed to correlate the regions with the device
 tree structure that drivers use to determine the meaning
 of device resources.
 
 The VFIO_DEVICE_GET_REGION_INFO is extended to provide
 device tree information.
 
 The following information is needed:
-the device tree path to the node corresponding to the
 region
-whether it corresponds to a reg or ranges property
-there could be multiple sub-regions per reg or ranges and
 the sub-index within the reg/ranges is needed
 
 There are 5 new flags added to vfio_region_info :
 
 struct vfio_region_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_REGION_INFO_FLAG_CACHEABLE (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_REG (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_RANGE (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_INDEX (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_PATH (1  ?)
  __u32   index;  /* Region index */
  __u32   resv;   /* Reserved for alignment */
  __u64   size;   /* Region size (bytes) */
  __u64   offset; /* Region offset from start of device fd */
 };
 
 VFIO_REGION_INFO_FLAG_CACHEABLE
 -if set indicates that the region must be mapped as cacheable
 
 VFIO_DEVTREE_REGION_INFO_FLAG_REG
 -if set indicates that the region corresponds to a reg property
  in the device tree representation of the device
 
 VFIO_DEVTREE_REGION_INFO_FLAG_RANGE
 -if set indicates that the region corresponds to a ranges property
  in the device tree representation of the device
 
 VFIO_DEVTREE_REGION_INFO_FLAG_INDEX
 -if set indicates that there is a dword aligned struct
  struct vfio_devtree_region_info_index appended to the
  end of vfio_region_info:
 
  struct vfio_devtree_region_info_index
  {
u32 index;
  }
 
  A reg or ranges property may have multiple regsion.  The index
  specifies the index within the reg or ranges
  that this region corresponds to.
 
 VFIO_DEVTREE_REGION_INFO_FLAG_PATH
 -if set indicates that there is a dword aligned struct
  struct vfio_devtree_info_path appended to the
  end of vfio_region_info:
 
  struct vfio_devtree_info_path
  {
  u32 len;
  u8 path[];
  }
 
  The path is the full path to the corresponding device
  tree node.  The len field specifies the length of the
  path string.
 
 If multiple flags are set that indicate that there is
 an appended struct, the order of the flags indicates
 the order of the structs.
 
 argsz is set by the kernel specifying the total size of
 struct vfio_region_info and all appended structs.
 
 Suggested usage:
-call VFIO_DEVICE_GET_REGION_INFO with argsz =
 sizeof(struct vfio_region_info)
-realloc the buffer
-call VFIO_DEVICE_GET_REGION_INFO again, and the appended
 structs will be returned
 
  4.  VFIO_DEVICE_GET_IRQ_INFO

Re: RFC: vfio interface for platform devices

2013-07-16 Thread Scott Wood


On 07/16/2013 04:51:12 PM, Yoder Stuart-B08248 wrote:

  3.  VFIO_DEVICE_GET_REGION_INFO
 
 No changes needed, except perhaps adding a new flag.  Freescale
  has some
 devices with regions that must be mapped cacheable.

 While I don't object to making the information available to the user
 just in case, the main thing we need here is to influence what the
 kernel does when the user tries to map it.  At least on PPC it's  
not up

 to userspace to select whether a mmap is cacheable.

If user space really can't do anything with the 'cacheable'
flag, do you think there is good reason to keep it?   Will it
help any decision that user space makes?  Maybe we should just
drop it.


As long as we can be sure all architectures will map things correctly  
without any flags needing to be specified, that's fine.



 struct vfio_path_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_DEVTREE_INFO_RANGES  (1  3) /* the region  
is a

  ranges property */

 What about distinguishing a normal interrupt from one found in an
 interrupt-map?

I'm not sure we need that.  The kernel needs to use the interrupt
map to get interrupts hooked up right, but all user space needs to
know is that there are N interrupts and possibly device tree
paths to help user space interpret which interrupt is which.


What if the interrupt map is for devices without explicit nodes, such  
as with a PCI controller (ignore the fact that we would normally use  
vfio_pci for the indivdual PCI devices instead)?


You could say the same thing about ranges -- why expose ranges instead  
of the individual child node regs after translation?



 In the case of both ranges and interrupt-maps, we'll also want to
 decide what the policy is for when to expose them directly, versus  
just

 using them to translate regs and interrupts of child nodes

Yes, not sure the best approach there...but guess we can cross
that bridge when we implement this.  It doesn't affect this
interface.


It does affect the interface, because if you allow either of them to be  
mapped directly (rather than implicitly used when mapping a child  
node), you need a way to indicate which type of resource it is you're  
describing (as you already do for reg/ranges).


It also affects how vfio device binding is done, even if only to the  
point of specifying default behavior in the absence of knobs which  
change whether interrupt maps and/or ranges are mapped.



  __u8path[]; /* output: Full path to associated
  device tree node */

 How does the caller know what size buffer to supply for this?


Ping

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: vfio interface for platform devices

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 16, 2013 5:01 PM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; Alex Williamson; Alexander Graf; Bhushan 
 Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis; 
 kvm@vger.kernel.org list; kvm-
 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 On 07/16/2013 04:51:12 PM, Yoder Stuart-B08248 wrote:
3.  VFIO_DEVICE_GET_REGION_INFO

   No changes needed, except perhaps adding a new flag.  Freescale
has some
   devices with regions that must be mapped cacheable.

   While I don't object to making the information available to the user
   just in case, the main thing we need here is to influence what the
   kernel does when the user tries to map it.  At least on PPC it's
  not up
   to userspace to select whether a mmap is cacheable.

  If user space really can't do anything with the 'cacheable'
  flag, do you think there is good reason to keep it?   Will it
  help any decision that user space makes?  Maybe we should just
  drop it.

 As long as we can be sure all architectures will map things correctly
 without any flags needing to be specified, that's fine.

   struct vfio_path_info {
__u32   argsz;
__u32   flags;
   #define VFIO_DEVTREE_INFO_RANGES  (1  3) /* the region
  is a
ranges property */

   What about distinguishing a normal interrupt from one found in an
   interrupt-map?

  I'm not sure we need that.  The kernel needs to use the interrupt
  map to get interrupts hooked up right, but all user space needs to
  know is that there are N interrupts and possibly device tree
  paths to help user space interpret which interrupt is which.

 What if the interrupt map is for devices without explicit nodes, such
 as with a PCI controller (ignore the fact that we would normally use
 vfio_pci for the indivdual PCI devices instead)?

 You could say the same thing about ranges -- why expose ranges instead
 of the individual child node regs after translation?

Hmm...yes, I guess ranges and interrupt-map fall into the same
basic type of resource category.  I'm not sure it's realistic
to pass entire bus controllers through to user space vs
just individual devices on a bus, but I guess it's theoretically
possible.

So the question is whether we future proof by adding flags 
for both ranges and interrupt-map, or wait until there is
an actual need for it.

   In the case of both ranges and interrupt-maps, we'll also want to
   decide what the policy is for when to expose them directly, versus
  just
   using them to translate regs and interrupts of child nodes

  Yes, not sure the best approach there...but guess we can cross
  that bridge when we implement this.  It doesn't affect this
  interface.

 It does affect the interface, because if you allow either of them to be
 mapped directly (rather than implicitly used when mapping a child
 node), you need a way to indicate which type of resource it is you're
 describing (as you already do for reg/ranges).

 It also affects how vfio device binding is done, even if only to the
 point of specifying default behavior in the absence of knobs which
 change whether interrupt maps and/or ranges are mapped.

My opinion is that we want to expose the regs and interrupts for
individual nodes by default, not ranges (or interrupt maps).   When someone
needs ranges/interrupt-map in the future they'll need to figure out some
means for the vfio layer to do the right thing.  It's complicated
and I would be surprised to see someone need it.

__u8path[]; /* output: Full path to associated
device tree node */

   How does the caller know what size buffer to supply for this?

 Ping

This is in the v2 RFC... the caller invokes the ioctl which returns
the complete/full size, then re-allocs the buffer and calls the
ioctl again.  Or, as Alex suggested, just use a sufficiently large
buffer to start with.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: vfio interface for platform devices

2013-07-16 Thread Scott Wood

On 07/16/2013 05:41:04 PM, Yoder Stuart-B08248 wrote:

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 16, 2013 5:01 PM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; Alex Williamson; Alexander Graf; Bhushan  
Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis;  
kvm@vger.kernel.org list; kvm-

 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 What if the interrupt map is for devices without explicit nodes,  
such

 as with a PCI controller (ignore the fact that we would normally use
 vfio_pci for the indivdual PCI devices instead)?

 You could say the same thing about ranges -- why expose ranges  
instead

 of the individual child node regs after translation?

Hmm...yes, I guess ranges and interrupt-map fall into the same
basic type of resource category.  I'm not sure it's realistic
to pass entire bus controllers through to user space vs
just individual devices on a bus, but I guess it's theoretically
possible.

Where theoretically possible means we've done it before in other  
contexts. :-)

So the question is whether we future proof by adding flags
for both ranges and interrupt-map, or wait until there is
an actual need for it.

We don't need to actually add a flag for it, but we should have a  
flag/type for the resources we do support, so that code written to the  
current API would recognize that it doesn't recognize an interrupt-map  
entry if it's added later.

__u8path[]; /* output: Full path to  
associated

device tree node */

   How does the caller know what size buffer to supply for this?

 Ping

This is in the v2 RFC... the caller invokes the ioctl which returns
the complete/full size, then re-allocs the buffer and calls the
ioctl again.

OK.

Or, as Alex suggested, just use a sufficiently large buffer to start  
with.

It's fine for a user of the API to simplify things by using a large  
fixed buffer, but the API shouldn't force that approach.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] booke: define reset and shutdown hcalls

2013-07-16 Thread Scott Wood


On 07/16/2013 01:35:55 AM, Gleb Natapov wrote:

On Mon, Jul 15, 2013 at 01:17:33PM -0500, Scott Wood wrote:
 On 07/15/2013 06:30:20 AM, Gleb Natapov wrote:
 There is no much sense to share hypercalls between architectures.
 There
 is zero probability x86 will implement those for instance

 This is similar to the question of whether to keep device API
 enumerations per-architecture...  It costs very little to keep it in
 a common place, and it's hard to go back in the other direction if
 we later realize there are things that should be shared.

This is different from device API since with device API all arches  
have

to create/destroy devices, so it make sense to put device lifecycle
management into the common code, and device API has single entry point
to the code - device fd ioctl - where it makes sense to handle common
tasks, if any, and despatch others to specific device implementation.

This is totally unlike hypercalls which are, by definition, very
architecture specific (the way they are triggered, the way parameter
are passed from guest to host, what hypercalls arch needs...).


The ABI is architecture specific.  The API doesn't need to be, any more  
than it does with syscalls (I consider the architecture-specific  
definition of syscall numbers and similar constants in Linux to be  
unfortunate, especially for tools such as strace or QEMU's linux-user  
emulation).



 Keeping it in a common place also makes it more visible to people
 looking to add new hcalls, which could cut down on reinventing the
 wheel.
I do not want other arches to start using hypercalls in the way  
powerpc

started to use them: separate device io space, so it is better to hide
this as far away from common code as possible :) But on a more serious
note hypercalls should be a last resort and added only when no other
possibility exists, so people should not look what hcalls others
implemented, so they can add them to their favorite arch, but they
should have a problem at hand that they cannot solve without hcall,  
but
at this point they will have pretty good idea what this hcall should  
do.


Why are hcalls such a bad thing?

Should new Linux syscalls be avoided too, in favor of new emulated  
devices exposed via vfio? :-)



 (not sure why PPC will want them either instead of emulating
 devices that do
 shutdown/reset).

 Besides what Alex said, for shutdown we don't have any existing
 device to emulate (our real hardware just doesn't have that
 functionality).  For reset we currently do emulate, but it's awkward
 to describe in the device tree what we actually emulate since the
 reset functionality is part of a kitchen-sink device of which we
 emulate virtually nothing other than the reset.  Currently we
 advertise the entire thing and just ignore the rest, but that causes
 problems with the guest seeing the node and trying to use that
 functionality.

What about writing virtio device for shutdown


That sounds like quite a bit more work than hcalls.  It also ties up a  
virtual PCI slot -- some machines don't have very many (mpc8544ds has  
2, though we could and should expand that in the paravirt e500 machine).



and add missing emulation
to kitchen-sink device (yeah I know easily said that done),


Not going to happen... there's lots of low-level and chip-specific  
stuff in there.  We'd have to make several versions, even for the  
paravirt platform so it would correspond to some chip that goes with  
the cpu being used.  Even then we couldn't do everything, at least with  
KVM -- one of the things in there is the ability to freeze the  
timebase, but reading the timebase doesn't trap, and we aren't going to  
freeze the host timebase.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Alexander Graf


On 16.07.2013, at 00:23, Scott Wood wrote:

 On 07/15/2013 03:55:08 PM, Alexander Graf wrote:
 On 15.07.2013, at 22:52, Scott Wood wrote:
  On 07/15/2013 03:28:46 PM, Alexander Graf wrote:
  On 15.07.2013, at 20:21, Scott Wood wrote:
   On 07/15/2013 10:16:41 AM, Bhushan Bharat-R65777 wrote:
 +printk(error: system reset returned with error %ld\n, 
 ret);

 So we should fall back to the normal reset handler here.

 Do you mean return normally from here, no BUG() etc?
   
If we guard the patching against everything, we can treat a broken 
hcall as BUG.
However, if we don't we want to fall back to the normal guts based 
reset.
   Will let Scott comment on this?
   But ppc_md.restart can point to only one handler and during paravirt 
   patching we changed this to new handler. So we cannot jump back to 
   guts type handler
  
   I don't think it's worth implementing a fall-back scheme -- if KVM 
   advertises that the reset hcall exists, then it had better exist.
  If we also check for kvm_para_available() I agree. Otherwise QEMU might 
  advertise the reset hcall, but the guest kernel may not implement KVM 
  hypercalls. In that case the device tree check will succeed, but the 
  actual hypercall will not.
 
  Wouldn't that be a bug in QEMU?  Or in KVM for exposing the hcall 
  capability without implementing them?
 No, because it would be the guest that doesn't know how to handle kvm 
 hypercalls.
 
 Oh, I misread guest kernel as host kernel. :-P
 
 Still, I'm not sure what sort of error you're thinking of.  If the guest 
 didn't support the hcall mechanism we would have returned from the function 
 by that point.  In fact, seeing kvm,has-reset on a different hypervisor ought 
 to mean that that hypervisor is emulating KVM in this particular respect.

Imagine we're running on KVM with this reset hcall support. Now if the guest 
has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST disabled, we would patch 
the callback, but kvm_hypercall0() would be implemented as a nop.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Scott Wood


On 07/16/2013 06:21:51 PM, Alexander Graf wrote:


On 16.07.2013, at 00:23, Scott Wood wrote:

 Still, I'm not sure what sort of error you're thinking of.  If the  
guest didn't support the hcall mechanism we would have returned from  
the function by that point.  In fact, seeing kvm,has-reset on a  
different hypervisor ought to mean that that hypervisor is emulating  
KVM in this particular respect.


Imagine we're running on KVM with this reset hcall support. Now if  
the guest has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST  
disabled, we would patch the callback, but kvm_hypercall0() would be  
implemented as a nop.


Ugh -- that should be renamed epapr_hypercall and moved to  
epapr_paravirt.c.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Scott Wood


On 07/16/2013 06:26:40 PM, Scott Wood wrote:

On 07/16/2013 06:21:51 PM, Alexander Graf wrote:


On 16.07.2013, at 00:23, Scott Wood wrote:

 Still, I'm not sure what sort of error you're thinking of.  If the  
guest didn't support the hcall mechanism we would have returned from  
the function by that point.  In fact, seeing kvm,has-reset on a  
different hypervisor ought to mean that that hypervisor is emulating  
KVM in this particular respect.


Imagine we're running on KVM with this reset hcall support. Now if  
the guest has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST  
disabled, we would patch the callback, but kvm_hypercall0() would be  
implemented as a nop.


Ugh -- that should be renamed epapr_hypercall and moved to  
epapr_paravirt.c.


Or rather, kvm_hypercall() should become epapr_hypercall() in  
epapr_paravirt.c -- there's nothing KVM-specific about it.


kvm_hypercall0() and friends could become epapr_hypercall0() in  
epapr_hcalls.h, with the KVM_HCALL_TOKEN() moved to the caller.  Or  
they could stay as they are but depend on CONFIG_EPAPR_PARAVIRT rather  
than CONFIG_KVM_GUEST -- there's no real dependency on the rest of the  
KVM guest code.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linux Plumbers ACPI/PM, PCI Microconference

2013-07-16 Thread Myron Stowe

Linux Plumbers has approved an ACPI/PM, PCI microconference. The
overview page is here:

http://wiki.linuxplumbersconf.org/2013:pci_subsystem

We would like to start receiving volunteers for presenting topics of
interest.  There is a lot of activity in these subsystems so please
respond by submitting presentation or discussion proposals that you
would be willing to cover for consideration.  You should also feel
free to submit ideas as proposals that others could cover.  The
instructions for submitting ideas should be at:

http://www.linuxplumbersconf.org/2013/submitting-topic/

This is the Plumbers conference so topics concerning userspace
interaction with the kernel and not just topics concerning pure kernel
subsystem internals would be greatly welcomed.  As a possible aid
towards promoting topics for submissions we have listed a few
candidates that seem to meet the criteria in the microconference
proposals wiki located at
http://wiki.linuxplumbersconf.org/2013:pci_subsystem

Last year, we had a good discussion on general issues in PCI space -
that also might be worth giving updates on.

If you have topics that you would like to add, wait until the
instructions get posted at the link above. If you are impatient, feel
free to email me directly (but probably best to drop the broad mailing
lists from the reply).

Thanks!

Len, Bjorn, Myron, Rafael



Yinghai - An update on host bridge hot-add capabilities, what is in
place and what still remains to be done, would seem like a good topic.

Shuah - You brought up the idea about Converting drivers from Legacy
PM ops to dev_pm_ops; would you like to present what you have
done/encountered so far?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vga passthrough // vfio // qemu bridge

2013-07-16 Thread Martin Wolf

thank you for the quick response alex,

but i still need your help ;)
i cloned both git trees
(http://lists.gnu.org/archive/html/qemu-devel/2013-05/msg00432.html)
that was the easy part for me (it boots like a charm ...) ... then i
built the qemu tree and found out
that it is just 1.4.50 and something was missing
so qemu would not start up ...

would you be so kind alex and supply me with
the patches you meant yesterday i would need for qemu?

ty in advance



Am 16.07.2013 16:25, schrieb Alex Williamson:
 On Tue, 2013-07-16 at 14:35 +0200, Martin Wolf wrote:
 Early 2012 i tested the old vga passthrough capabilities of KVM and was
 partly successful.
 now with the new vfio driver i tried again according to alex's hints and
 this guide:
 https://bbs.archlinux.org/viewtopic.php?id=162768

 since im primarily using ubuntu i used the daily build of saucy.
 it ships qemu 1.5 and seabios 1.7.3 so the requirements are met.

 according to the guide i prepared the vga card (amd 7870)

 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
 root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
 pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
 [0.00] Kernel command line:
 BOOT_IMAGE=/boot/vmlinuz-3.10.0-2-generic
 root=UUID=26fed560-a972-499d-ab14-7fec6439fd3d ro intel_iommu=on
 pci-stub.ids=1002:6818,1002:aab0 quiet splash vt.handoff=7
 [0.569977] pci-stub: add 1002:6818 sub=:
 cls=/
 [0.569987] pci-stub :01:00.0: claimed by stub
 [0.569994] pci-stub: add 1002:AAB0 sub=:
 cls=/
 [0.569998] pci-stub :01:00.1: claimed by stub

 then did this just to be sure:
 echo options vfio_iommu_type1 allow_unsafe_interrupts=1 
 /etc/modprobe.d/vfio_iommu_type1.conf
 (or was that wrong?)
 im using a z87 haswell mainboard
 Hopefully not needed, only use this option if you need to.  vfio will
 print an error to dmesg and qemu will fail to start if you need it.

 after that i binded the two devices to vfio-pci with:
 vfio-bind :01:00.0 :01:00.1 (the script in the guide)

 afterwards i was able to start the kvm with
 qemu-system-x86_64 -enable-kvm -M q35 -m 8192 -cpu host \
 -smp 8,sockets=1,cores=4,threads=8 \
 -bios /usr/share/qemu/bios.bin -vga none \
 -device
 ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
 -device
 vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
 -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 \
 -device ahci,bus=pcie.0,id=ahci \
 -drive file=/home/martin/windows.img,if=none,id=disk,format=raw -device
 virtio-blk-pci,drive=disk \
 -drive file=/home/martin/X17-59885.iso,id=isocd -device
 ide-cd,bus=ahci.0,drive=isocd \
 -net nic,model=virtio \
 -net user \
 -usb -usbdevice host:1532:000c \
 -drive file=/home/martin/Downloads/virtio-win-0.1-59.iso,id=isocd1
 -device ide-cd,bus=ahci.1,drive=isocd1

 to my surprise i instantly got the windows installation running
 installed the virtio drivers for nic and storage
 and had 15 mins later a working win7 installation.
 now i installed the amd driver (13.4) and rebooted.
 i got a bluescreen. similar to my old expiriences so i thought do a
 clean host reboot and try again.
 but still the same. so i tried to load the bios.rom for the card (found
 it on techpowerup) again no luck.
 maybe someone knows a hint?
 I think most of the folks using the guide you reference are using my
 vfio-vga-reset branches of qemu  kernel (or patches extracted from
 them).  These add an important step for reproducibility, by being able
 to reset the bus for the graphics card, giving us a way to reset the
 device.  The other thing in the qemu branch are improved quirks.  I've
 just posted these to qemu-devel and plan to get them pulled for 1.6.
 Note that for ATI/AMD cards, a critical quirks is intercepting the byte
 at I/O port address 0x3c3.  Without this, the VGA BIOS can't bootstrap
 itself.  The vfio-vga-reset branch has a conditional replacement of
 this, which doesn't seem to work for everyone.  I believe the version I
 posted to qemu-devel yesterday is a better implementation of that quirk.

 ---

 about qemu bridge
 i tried to set up a bridge with the config but qemu always told me that
 qemu-bridge-helper is not present.
 all i found out that it propably got removed from the package because of
 the lack of control over the tap
 devices.
 now my question how can i still bridge the vm into my network without
 that helper?
 I don't know what qemu-bridge-helper is/was, but you're probably better
 off asking bridge questions in a separate thread instead of buried here.
 Thanks,

 Alex


-- 
Adiumentum GmbH
Gf. Martin Wolf
Banderbacherstraße 76
90513 Zirndorf

0911 / 9601470
mw...@adiumentum.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] booke: define reset and shutdown hcalls

2013-07-16 Thread Gleb Natapov

On Mon, Jul 15, 2013 at 01:17:33PM -0500, Scott Wood wrote:
 On 07/15/2013 06:30:20 AM, Gleb Natapov wrote:
 On Mon, Jul 15, 2013 at 04:41:17PM +0530, Bharat Bhushan wrote:
  KVM_HC_VM_RESET: Requests that the virtual machine be reset.
  KVM_HC_VM_SHUTDOWN: Requests that the virtual machine be
 powered-off/halted.
 
  These hcalls are handled by guest userspace.
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
   Documentation/virtual/kvm/hypercalls.txt |   16 
   include/uapi/linux/kvm_para.h|3 ++-
   2 files changed, 18 insertions(+), 1 deletions(-)
 
  diff --git a/Documentation/virtual/kvm/hypercalls.txt
 b/Documentation/virtual/kvm/hypercalls.txt
  index ea113b5..58acdc1 100644
  --- a/Documentation/virtual/kvm/hypercalls.txt
  +++ b/Documentation/virtual/kvm/hypercalls.txt
  @@ -64,3 +64,19 @@ Purpose: To enable communication between the
 hypervisor and guest there is a
   shared page that contains parts of supervisor visible register
 state.
   The guest can map this shared page to access its supervisor
 register through
   memory using this hypercall.
  +
  +5. KVM_HC_VM_RESET
  +
  +Architecture: PPC
  +Status: active
  +Purpose:  Requests that the virtual machine be reset.  The
 hcall takes no
  +arguments. If successful the hcall does not return. If an error
 occurs it
  +returns EV_INTERNAL.
  +
  +6. KVM_HC_VM_SHUTDOWN
  +
  +Architecture: PPC
  +Status: active
  +Purpose: Requests that the virtual machine be powered-off/halted.
  +The hcall takes no arguments. If successful the hcall does not
 return.
  +If an error occurs it returns EV_INTERNAL.
  diff --git a/include/uapi/linux/kvm_para.h
 b/include/uapi/linux/kvm_para.h
  index cea2c5c..218882d 100644
  --- a/include/uapi/linux/kvm_para.h
  +++ b/include/uapi/linux/kvm_para.h
  @@ -19,7 +19,8 @@
   #define KVM_HC_MMU_OP 2
   #define KVM_HC_FEATURES   3
   #define KVM_HC_PPC_MAP_MAGIC_PAGE 4
  -
  +#define KVM_HC_VM_RESET   5
  +#define KVM_HC_VM_SHUTDOWN6
 There is no much sense to share hypercalls between architectures.
 There
 is zero probability x86 will implement those for instance
 
 This is similar to the question of whether to keep device API
 enumerations per-architecture...  It costs very little to keep it in
 a common place, and it's hard to go back in the other direction if
 we later realize there are things that should be shared.

This is different from device API since with device API all arches have
to create/destroy devices, so it make sense to put device lifecycle
management into the common code, and device API has single entry point
to the code - device fd ioctl - where it makes sense to handle common
tasks, if any, and despatch others to specific device implementation.

This is totally unlike hypercalls which are, by definition, very
architecture specific (the way they are triggered, the way parameter
are passed from guest to host, what hypercalls arch needs...). The
entry point of hypercalls is in arch specific code (again unlike device
API), so they way to reuse code if need arise is different too and does
not require common name space - just call common function after
retrieving hypercalls parameters in arch specific way.

 Keeping it in a common place also makes it more visible to people
 looking to add new hcalls, which could cut down on reinventing the
 wheel.
I do not want other arches to start using hypercalls in the way powerpc
started to use them: separate device io space, so it is better to hide
this as far away from common code as possible :) But on a more serious
note hypercalls should be a last resort and added only when no other
possibility exists, so people should not look what hcalls others
implemented, so they can add them to their favorite arch, but they
should have a problem at hand that they cannot solve without hcall, but
at this point they will have pretty good idea what this hcall should do.

 
 (not sure why PPC will want them either instead of emulating
 devices that do
 shutdown/reset).
 
 Besides what Alex said, for shutdown we don't have any existing
 device to emulate (our real hardware just doesn't have that
 functionality).  For reset we currently do emulate, but it's awkward
 to describe in the device tree what we actually emulate since the
 reset functionality is part of a kitchen-sink device of which we
 emulate virtually nothing other than the reset.  Currently we
 advertise the entire thing and just ignore the rest, but that causes
 problems with the guest seeing the node and trying to use that
 functionality.
 
What about writing virtio device for shutdown and add missing emulation
to kitchen-sink device (yeah I know easily said that done), or make
the virtio device handle reset too? This of course raises the question
what address to use for a device hence the idea to use hcalls as
separate address space.
 
--

RE: RFC: vfio interface for platform devices (v2)

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Mario Smarduch [mailto:mario.smard...@huawei.com]
 Sent: Thursday, July 04, 2013 9:45 AM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Alexander Graf; Wood Scott-B07421; k...@vger.kernel.org 
 list; Bhushan Bharat-R65777;
 kvm-ppc@vger.kernel.org; virtualizat...@lists.linux-foundation.org; Sethi 
 Varun-B16395;
 kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices (v2)

 I'm having trouble understanding how this works where
 the Guest Device Model != Host. How do you inform the guest
 where the device is mapped in its physical address space,
 and handle GPA faults?

The vfio mechanisms just expose hardware to user space
and the user space app may or may not QEMU.  So there
may be no 'guest' at all.

The intent of this RFC is to provide enough info to user space so
an application can use the device, or in the case of QEMU expose
the device to a VM.  Platform devices are typically exposed via
the device tree and that is how I envision them being presented
to a guest.

Are there real cases you see where guest device model != host?
I don't envision ever presenting a platform device as a PCI device
or vise versa.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: vfio interface for platform devices

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Wood Scott-B07421
 Sent: Wednesday, July 03, 2013 5:32 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Alexander Graf; Wood Scott-B07421; Bhushan 
 Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis; 
 k...@vger.kernel.org list; kvm-
 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 On 07/02/2013 06:25:59 PM, Yoder Stuart-B08248 wrote:
  The write-up below is the first draft of a proposal for how the
  kernel can expose
  platform devices to user space using vfio.

  In short, I'm proposing a new ioctl VFIO_DEVICE_GET_DEVTREE_INFO which
  allows user space to correlate regions and interrupts to the
  corresponding
  device tree node structure that is defined for most platform devices.

  Regards,
  Stuart Yoder

  --
  VFIO for Platform Devices

  The existing infrastructure for vfio-pci is pretty close to what we
  need:
 -mechanism to create a container
 -add groups/devices to a container
 -set the IOMMU model
 -map DMA regions
 -get an fd for a specific device, which allows user space to
  determine
  info about device regions (e.g. registers) and interrupt info
 -support for mmapping device regions
 -mechanism to set how interrupts are signaled

  Platform devices can get complicated-- potentially with a tree
  hierarchy
  of nodes, and links/phandles pointing to other platform
  devices.   The kernel doesn't expose relationships between
  devices.  The kernel just exposes mappable register regions and
  interrupts.
  It's up to user space to work out relationships between devices
  if it needs to-- this can be determined in the device tree exposed in
  /proc/device-tree.

  I think the changes needed for vfio are around some of the device tree
  related info that needs to be available with the device fd.

  1.  VFIO_GROUP_GET_DEVICE_FD

User space has to know which device it is accessing and will call
VFIO_GROUP_GET_DEVICE_FD passing a specific platform device path to
get the device information:

fd = ioctl(group, VFIO_GROUP_GET_DEVICE_FD,
  /soc@ffe00/usb@21);

(whether the path is a device tree path or a sysfs path is up for
discussion, e.g. /sys/bus/platform/devices/ffe21.usb)

 Doesn't VFIO need to operate on an actual Linux device, rather than
 just an OF node?

 Are we going to have a fixed assumption that you always want all the
 children of the node corresponding to the assigned device, or will it
 be possible to exclude some?

My assumption is that you always get all the children of the
node corresponding to the assigned device.

  2.  VFIO_DEVICE_GET_INFO

 Don't think any changes are needed to VFIO_DEVICE_GET_INFO other
 than adding a new flag identifying a devices as a 'platform'
 device.

 This ioctl simply returns the number of regions and number of irqs.

 The number of regions corresponds to the number of regions
 that can be mapped for the device-- corresponds to the regions
  defined
 in reg and ranges in the device tree.

  3.  VFIO_DEVICE_GET_REGION_INFO

 No changes needed, except perhaps adding a new flag.  Freescale
  has some
 devices with regions that must be mapped cacheable.

 While I don't object to making the information available to the user
 just in case, the main thing we need here is to influence what the
 kernel does when the user tries to map it.  At least on PPC it's not up
 to userspace to select whether a mmap is cacheable.

If user space really can't do anything with the 'cacheable'
flag, do you think there is good reason to keep it?   Will it
help any decision that user space makes?  Maybe we should just
drop it.

  4. VFIO_DEVICE_GET_DEVTREE_INFO

 The VFIO_DEVICE_GET_REGION_INFO and VFIO_DEVICE_GET_IRQ_INFO APIs
 expose device regions and interrupts, but it's not enough to know
 that there are X regions and Y interrupts.  User space needs to
 know what the resources are for-- to correlate those
  regions/interrupts
 to the device tree structure that drivers use.  The device tree
 structure could consist of multiple nodes and it is necessary to
 identify the node corresponding to the region/interrupt exposed
 by VFIO.

 The following information is needed:
-the device tree path to the node corresponding to the
 region or interrupt
-for a region, whether it corresponds to a reg or ranges
 property
-there could be multiple sub-regions per reg or ranges and
 the sub-index within the reg/ranges is needed

 The VFIO_DEVICE_GET_DEVTREE_INFO operates on a device fd.

 ioctl: VFIO_DEVICE_GET_DEVTREE_INFO

 struct vfio_path_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_DEVTREE_INFO_RANGES  (1  3) /*

RE: RFC: vfio interface for platform devices (v2)

2013-07-16 Thread Yoder Stuart-B08248

(sorry for the delayed response, but I've been on PTO)

  1.  VFIO_GROUP_GET_DEVICE_FD
 
User space knows by out-of-band means which device it is accessing
and will call VFIO_GROUP_GET_DEVICE_FD passing a specific sysfs path
to get the device information:
 
fd = ioctl(group, VFIO_GROUP_GET_DEVICE_FD,
   /sys/bus/platform/devices/ffe21.usb));
 
 FWIW, I'm in favor of whichever way works out cleaner in the code for
 pre-pending /sys/bus or not.  It sort of seems like it's unnecessary.
 It's also a little inconsistent that the returned path doesn't
 pre-pend /sys in the examples below.

Ok.  For the returned path in the examples I have the actual device tree
path which is slightly different from the path in /sys.  The device
tree path is what user space would need to interpret /proc/device-tree.

  2.  VFIO_DEVICE_GET_INFO
 
 The number of regions corresponds to the regions defined
 in reg and ranges in the device tree.
 
 Two new flags are added to struct vfio_device_info:
 
 #define VFIO_DEVICE_FLAGS_PLATFORM (1  ?) /* A platform bus device */
 #define VFIO_DEVICE_FLAGS_DEVTREE  (1  ?) /* device tree info 
  available */
 
 It is possible that there could be platform bus devices
 that are not in the device tree, so we use 2 flags to
 allow for that.
 
 If just VFIO_DEVICE_FLAGS_PLATFORM is set, it means
 that there are regions and IRQs but no device tree info
 available.
 
 If just VFIO_DEVICE_FLAGS_DEVTREE is set, it means
 there is device tree info available.
 
 But it would be invalid to only have DEVTREE w/o PLATFORM for now,
 right?

Right.  The way I stated it is incorrect. DEVTREE would never
be set by itself.

  3. VFIO_DEVICE_GET_REGION_INFO
 
 For platform devices with multiple regions, information
 is needed to correlate the regions with the device
 tree structure that drivers use to determine the meaning
 of device resources.
 
 The VFIO_DEVICE_GET_REGION_INFO is extended to provide
 device tree information.
 
 The following information is needed:
-the device tree path to the node corresponding to the
 region
-whether it corresponds to a reg or ranges property
-there could be multiple sub-regions per reg or ranges and
 the sub-index within the reg/ranges is needed
 
 There are 5 new flags added to vfio_region_info :
 
 struct vfio_region_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_REGION_INFO_FLAG_CACHEABLE (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_REG (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_RANGE (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_INDEX (1  ?)
 #define VFIO_DEVTREE_REGION_INFO_FLAG_PATH (1  ?)
  __u32   index;  /* Region index */
  __u32   resv;   /* Reserved for alignment */
  __u64   size;   /* Region size (bytes) */
  __u64   offset; /* Region offset from start of device fd */
 };
 
 VFIO_REGION_INFO_FLAG_CACHEABLE
 -if set indicates that the region must be mapped as cacheable
 
 VFIO_DEVTREE_REGION_INFO_FLAG_REG
 -if set indicates that the region corresponds to a reg property
  in the device tree representation of the device
 
 VFIO_DEVTREE_REGION_INFO_FLAG_RANGE
 -if set indicates that the region corresponds to a ranges property
  in the device tree representation of the device
 
 VFIO_DEVTREE_REGION_INFO_FLAG_INDEX
 -if set indicates that there is a dword aligned struct
  struct vfio_devtree_region_info_index appended to the
  end of vfio_region_info:
 
  struct vfio_devtree_region_info_index
  {
u32 index;
  }
 
  A reg or ranges property may have multiple regsion.  The index
  specifies the index within the reg or ranges
  that this region corresponds to.
 
 VFIO_DEVTREE_REGION_INFO_FLAG_PATH
 -if set indicates that there is a dword aligned struct
  struct vfio_devtree_info_path appended to the
  end of vfio_region_info:
 
  struct vfio_devtree_info_path
  {
  u32 len;
  u8 path[];
  }
 
  The path is the full path to the corresponding device
  tree node.  The len field specifies the length of the
  path string.
 
 If multiple flags are set that indicate that there is
 an appended struct, the order of the flags indicates
 the order of the structs.
 
 argsz is set by the kernel specifying the total size of
 struct vfio_region_info and all appended structs.
 
 Suggested usage:
-call VFIO_DEVICE_GET_REGION_INFO with argsz =
 sizeof(struct vfio_region_info)
-realloc the buffer
-call VFIO_DEVICE_GET_REGION_INFO again, and the appended
 structs will be returned
 
  4.  VFIO_DEVICE_GET_IRQ_INFO

Re: RFC: vfio interface for platform devices

2013-07-16 Thread Scott Wood


On 07/16/2013 04:51:12 PM, Yoder Stuart-B08248 wrote:

  3.  VFIO_DEVICE_GET_REGION_INFO
 
 No changes needed, except perhaps adding a new flag.  Freescale
  has some
 devices with regions that must be mapped cacheable.

 While I don't object to making the information available to the user
 just in case, the main thing we need here is to influence what the
 kernel does when the user tries to map it.  At least on PPC it's  
not up

 to userspace to select whether a mmap is cacheable.

If user space really can't do anything with the 'cacheable'
flag, do you think there is good reason to keep it?   Will it
help any decision that user space makes?  Maybe we should just
drop it.


As long as we can be sure all architectures will map things correctly  
without any flags needing to be specified, that's fine.



 struct vfio_path_info {
  __u32   argsz;
  __u32   flags;
 #define VFIO_DEVTREE_INFO_RANGES  (1  3) /* the region  
is a

  ranges property */

 What about distinguishing a normal interrupt from one found in an
 interrupt-map?

I'm not sure we need that.  The kernel needs to use the interrupt
map to get interrupts hooked up right, but all user space needs to
know is that there are N interrupts and possibly device tree
paths to help user space interpret which interrupt is which.


What if the interrupt map is for devices without explicit nodes, such  
as with a PCI controller (ignore the fact that we would normally use  
vfio_pci for the indivdual PCI devices instead)?


You could say the same thing about ranges -- why expose ranges instead  
of the individual child node regs after translation?



 In the case of both ranges and interrupt-maps, we'll also want to
 decide what the policy is for when to expose them directly, versus  
just

 using them to translate regs and interrupts of child nodes

Yes, not sure the best approach there...but guess we can cross
that bridge when we implement this.  It doesn't affect this
interface.


It does affect the interface, because if you allow either of them to be  
mapped directly (rather than implicitly used when mapping a child  
node), you need a way to indicate which type of resource it is you're  
describing (as you already do for reg/ranges).


It also affects how vfio device binding is done, even if only to the  
point of specifying default behavior in the absence of knobs which  
change whether interrupt maps and/or ranges are mapped.



  __u8path[]; /* output: Full path to associated
  device tree node */

 How does the caller know what size buffer to supply for this?


Ping

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: vfio interface for platform devices

2013-07-16 Thread Yoder Stuart-B08248

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 16, 2013 5:01 PM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; Alex Williamson; Alexander Graf; Bhushan 
 Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis; 
 k...@vger.kernel.org list; kvm-
 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 On 07/16/2013 04:51:12 PM, Yoder Stuart-B08248 wrote:
3.  VFIO_DEVICE_GET_REGION_INFO

   No changes needed, except perhaps adding a new flag.  Freescale
has some
   devices with regions that must be mapped cacheable.

   While I don't object to making the information available to the user
   just in case, the main thing we need here is to influence what the
   kernel does when the user tries to map it.  At least on PPC it's
  not up
   to userspace to select whether a mmap is cacheable.

  If user space really can't do anything with the 'cacheable'
  flag, do you think there is good reason to keep it?   Will it
  help any decision that user space makes?  Maybe we should just
  drop it.

 As long as we can be sure all architectures will map things correctly
 without any flags needing to be specified, that's fine.

   struct vfio_path_info {
__u32   argsz;
__u32   flags;
   #define VFIO_DEVTREE_INFO_RANGES  (1  3) /* the region
  is a
ranges property */

   What about distinguishing a normal interrupt from one found in an
   interrupt-map?

  I'm not sure we need that.  The kernel needs to use the interrupt
  map to get interrupts hooked up right, but all user space needs to
  know is that there are N interrupts and possibly device tree
  paths to help user space interpret which interrupt is which.

 What if the interrupt map is for devices without explicit nodes, such
 as with a PCI controller (ignore the fact that we would normally use
 vfio_pci for the indivdual PCI devices instead)?

 You could say the same thing about ranges -- why expose ranges instead
 of the individual child node regs after translation?

Hmm...yes, I guess ranges and interrupt-map fall into the same
basic type of resource category.  I'm not sure it's realistic
to pass entire bus controllers through to user space vs
just individual devices on a bus, but I guess it's theoretically
possible.

So the question is whether we future proof by adding flags 
for both ranges and interrupt-map, or wait until there is
an actual need for it.

   In the case of both ranges and interrupt-maps, we'll also want to
   decide what the policy is for when to expose them directly, versus
  just
   using them to translate regs and interrupts of child nodes

  Yes, not sure the best approach there...but guess we can cross
  that bridge when we implement this.  It doesn't affect this
  interface.

 It does affect the interface, because if you allow either of them to be
 mapped directly (rather than implicitly used when mapping a child
 node), you need a way to indicate which type of resource it is you're
 describing (as you already do for reg/ranges).

 It also affects how vfio device binding is done, even if only to the
 point of specifying default behavior in the absence of knobs which
 change whether interrupt maps and/or ranges are mapped.

My opinion is that we want to expose the regs and interrupts for
individual nodes by default, not ranges (or interrupt maps).   When someone
needs ranges/interrupt-map in the future they'll need to figure out some
means for the vfio layer to do the right thing.  It's complicated
and I would be surprised to see someone need it.

__u8path[]; /* output: Full path to associated
device tree node */

   How does the caller know what size buffer to supply for this?

 Ping

This is in the v2 RFC... the caller invokes the ioctl which returns
the complete/full size, then re-allocs the buffer and calls the
ioctl again.  Or, as Alex suggested, just use a sufficiently large
buffer to start with.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: vfio interface for platform devices

2013-07-16 Thread Scott Wood

On 07/16/2013 05:41:04 PM, Yoder Stuart-B08248 wrote:

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 16, 2013 5:01 PM
 To: Yoder Stuart-B08248
 Cc: Wood Scott-B07421; Alex Williamson; Alexander Graf; Bhushan  
Bharat-R65777; Sethi Varun-B16395;
 virtualizat...@lists.linux-foundation.org; Antonios Motakis;  
k...@vger.kernel.org list; kvm-

 p...@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: RFC: vfio interface for platform devices

 What if the interrupt map is for devices without explicit nodes,  
such

 as with a PCI controller (ignore the fact that we would normally use
 vfio_pci for the indivdual PCI devices instead)?

 You could say the same thing about ranges -- why expose ranges  
instead

 of the individual child node regs after translation?

Hmm...yes, I guess ranges and interrupt-map fall into the same
basic type of resource category.  I'm not sure it's realistic
to pass entire bus controllers through to user space vs
just individual devices on a bus, but I guess it's theoretically
possible.

Where theoretically possible means we've done it before in other  
contexts. :-)

So the question is whether we future proof by adding flags
for both ranges and interrupt-map, or wait until there is
an actual need for it.

We don't need to actually add a flag for it, but we should have a  
flag/type for the resources we do support, so that code written to the  
current API would recognize that it doesn't recognize an interrupt-map  
entry if it's added later.

__u8path[]; /* output: Full path to  
associated

device tree node */

   How does the caller know what size buffer to supply for this?

 Ping

This is in the v2 RFC... the caller invokes the ioctl which returns
the complete/full size, then re-allocs the buffer and calls the
ioctl again.

OK.

Or, as Alex suggested, just use a sufficiently large buffer to start  
with.

It's fine for a user of the API to simplify things by using a large  
fixed buffer, but the API shouldn't force that approach.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] booke: define reset and shutdown hcalls

2013-07-16 Thread Scott Wood


On 07/16/2013 01:35:55 AM, Gleb Natapov wrote:

On Mon, Jul 15, 2013 at 01:17:33PM -0500, Scott Wood wrote:
 On 07/15/2013 06:30:20 AM, Gleb Natapov wrote:
 There is no much sense to share hypercalls between architectures.
 There
 is zero probability x86 will implement those for instance

 This is similar to the question of whether to keep device API
 enumerations per-architecture...  It costs very little to keep it in
 a common place, and it's hard to go back in the other direction if
 we later realize there are things that should be shared.

This is different from device API since with device API all arches  
have

to create/destroy devices, so it make sense to put device lifecycle
management into the common code, and device API has single entry point
to the code - device fd ioctl - where it makes sense to handle common
tasks, if any, and despatch others to specific device implementation.

This is totally unlike hypercalls which are, by definition, very
architecture specific (the way they are triggered, the way parameter
are passed from guest to host, what hypercalls arch needs...).


The ABI is architecture specific.  The API doesn't need to be, any more  
than it does with syscalls (I consider the architecture-specific  
definition of syscall numbers and similar constants in Linux to be  
unfortunate, especially for tools such as strace or QEMU's linux-user  
emulation).



 Keeping it in a common place also makes it more visible to people
 looking to add new hcalls, which could cut down on reinventing the
 wheel.
I do not want other arches to start using hypercalls in the way  
powerpc

started to use them: separate device io space, so it is better to hide
this as far away from common code as possible :) But on a more serious
note hypercalls should be a last resort and added only when no other
possibility exists, so people should not look what hcalls others
implemented, so they can add them to their favorite arch, but they
should have a problem at hand that they cannot solve without hcall,  
but
at this point they will have pretty good idea what this hcall should  
do.


Why are hcalls such a bad thing?

Should new Linux syscalls be avoided too, in favor of new emulated  
devices exposed via vfio? :-)



 (not sure why PPC will want them either instead of emulating
 devices that do
 shutdown/reset).

 Besides what Alex said, for shutdown we don't have any existing
 device to emulate (our real hardware just doesn't have that
 functionality).  For reset we currently do emulate, but it's awkward
 to describe in the device tree what we actually emulate since the
 reset functionality is part of a kitchen-sink device of which we
 emulate virtually nothing other than the reset.  Currently we
 advertise the entire thing and just ignore the rest, but that causes
 problems with the guest seeing the node and trying to use that
 functionality.

What about writing virtio device for shutdown


That sounds like quite a bit more work than hcalls.  It also ties up a  
virtual PCI slot -- some machines don't have very many (mpc8544ds has  
2, though we could and should expand that in the paravirt e500 machine).



and add missing emulation
to kitchen-sink device (yeah I know easily said that done),


Not going to happen... there's lots of low-level and chip-specific  
stuff in there.  We'd have to make several versions, even for the  
paravirt platform so it would correspond to some chip that goes with  
the cpu being used.  Even then we couldn't do everything, at least with  
KVM -- one of the things in there is the ability to freeze the  
timebase, but reading the timebase doesn't trap, and we aren't going to  
freeze the host timebase.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Alexander Graf


On 16.07.2013, at 00:23, Scott Wood wrote:

 On 07/15/2013 03:55:08 PM, Alexander Graf wrote:
 On 15.07.2013, at 22:52, Scott Wood wrote:
  On 07/15/2013 03:28:46 PM, Alexander Graf wrote:
  On 15.07.2013, at 20:21, Scott Wood wrote:
   On 07/15/2013 10:16:41 AM, Bhushan Bharat-R65777 wrote:
 +printk(error: system reset returned with error %ld\n, 
 ret);

 So we should fall back to the normal reset handler here.

 Do you mean return normally from here, no BUG() etc?
   
If we guard the patching against everything, we can treat a broken 
hcall as BUG.
However, if we don't we want to fall back to the normal guts based 
reset.
   Will let Scott comment on this?
   But ppc_md.restart can point to only one handler and during paravirt 
   patching we changed this to new handler. So we cannot jump back to 
   guts type handler
  
   I don't think it's worth implementing a fall-back scheme -- if KVM 
   advertises that the reset hcall exists, then it had better exist.
  If we also check for kvm_para_available() I agree. Otherwise QEMU might 
  advertise the reset hcall, but the guest kernel may not implement KVM 
  hypercalls. In that case the device tree check will succeed, but the 
  actual hypercall will not.
 
  Wouldn't that be a bug in QEMU?  Or in KVM for exposing the hcall 
  capability without implementing them?
 No, because it would be the guest that doesn't know how to handle kvm 
 hypercalls.
 
 Oh, I misread guest kernel as host kernel. :-P
 
 Still, I'm not sure what sort of error you're thinking of.  If the guest 
 didn't support the hcall mechanism we would have returned from the function 
 by that point.  In fact, seeing kvm,has-reset on a different hypervisor ought 
 to mean that that hypervisor is emulating KVM in this particular respect.

Imagine we're running on KVM with this reset hcall support. Now if the guest 
has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST disabled, we would patch 
the callback, but kvm_hypercall0() would be implemented as a nop.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Scott Wood


On 07/16/2013 06:21:51 PM, Alexander Graf wrote:


On 16.07.2013, at 00:23, Scott Wood wrote:

 Still, I'm not sure what sort of error you're thinking of.  If the  
guest didn't support the hcall mechanism we would have returned from  
the function by that point.  In fact, seeing kvm,has-reset on a  
different hypervisor ought to mean that that hypervisor is emulating  
KVM in this particular respect.


Imagine we're running on KVM with this reset hcall support. Now if  
the guest has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST  
disabled, we would patch the callback, but kvm_hypercall0() would be  
implemented as a nop.


Ugh -- that should be renamed epapr_hypercall and moved to  
epapr_paravirt.c.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] powerpc: using reset hcall when kvm,has-reset

2013-07-16 Thread Scott Wood


On 07/16/2013 06:26:40 PM, Scott Wood wrote:

On 07/16/2013 06:21:51 PM, Alexander Graf wrote:


On 16.07.2013, at 00:23, Scott Wood wrote:

 Still, I'm not sure what sort of error you're thinking of.  If the  
guest didn't support the hcall mechanism we would have returned from  
the function by that point.  In fact, seeing kvm,has-reset on a  
different hypervisor ought to mean that that hypervisor is emulating  
KVM in this particular respect.


Imagine we're running on KVM with this reset hcall support. Now if  
the guest has CONFIG_EPAPR_PARAVIRT enabled but CONFIG_KVM_GUEST  
disabled, we would patch the callback, but kvm_hypercall0() would be  
implemented as a nop.


Ugh -- that should be renamed epapr_hypercall and moved to  
epapr_paravirt.c.


Or rather, kvm_hypercall() should become epapr_hypercall() in  
epapr_paravirt.c -- there's nothing KVM-specific about it.


kvm_hypercall0() and friends could become epapr_hypercall0() in  
epapr_hcalls.h, with the KVM_HCALL_TOKEN() moved to the caller.  Or  
they could stay as they are but depend on CONFIG_EPAPR_PARAVIRT rather  
than CONFIG_KVM_GUEST -- there's no real dependency on the rest of the  
KVM guest code.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

70 matches

Mail list logo