[PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time

2011-06-29 Thread Glauber Costa
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/paravirt.h   |9 +
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/paravirt.c|9 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+   unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
 }
 
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
 /* These are in entry.S */
 extern void native_iret(void);
 extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
 
 struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+   .steal_clock = native_steal_clock,
 };
 
 struct pv_irq_ops pv_irq_ops = {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-06-29 Thread Glauber Costa
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+#define KVM_MSR_ENABLED 1
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for June 28

2011-06-29 Thread Marcelo Tosatti
On Wed, Jun 29, 2011 at 11:08:23AM +0100, Stefan Hajnoczi wrote:
 On Wed, Jun 29, 2011 at 8:57 AM, Kevin Wolf kw...@redhat.com wrote:
  Am 28.06.2011 21:41, schrieb Marcelo Tosatti:
  stream
  --
 
  1) base - remote
  2) base - remote - local
  3) base - local
 
  local image is always valid. Requires backing file support.
 
  With the above, this restriction wouldn't apply any more.
 
  Also I don't think we should mix approaches. Either both block copy and
  image streaming use backing files, or none of them do. Mixing means
  duplicating more code, and even worse, that you can't stop a block copy
  in the middle and continue with streaming (which I believe is a really
  valuable feature to have).
 
 Here is how the image streaming feature is used from HMP/QMP:
 
 The guest is running from an image file with a backing file.  The aim
 is to pull the data from the backing file and populate the image file
 so that the dependency on the backing file can be eliminated.
 
 1. Start a background streaming operation:
 
 (qemu) block_stream -a ide0-hd
 
 2. Check the status of the operation:
 
 (qemu) info block-stream
 Streaming device ide0-hd: Completed 512 of 34359738368 bytes
 
 3. The status changes when the operation completes:
 
 (qemu) info block-stream
 No active stream
 
 On completion the image file no longer has a backing file dependency.
 When streaming completes QEMU updates the image file metadata to
 indicate that no backing file is used.
 
 The QMP interface is similar but provides QMP events to signal
 streaming completion and failure.  Polling to query the streaming
 status is only used when the management application wishes to refresh
 progress information.
 
 If guest execution is interrupted by a power failure or QEMU crash,
 then the image file is still valid but streaming may be incomplete.
 When QEMU is launched again the block_stream command can be issued to
 resume streaming.
 
 In the future we could add a 'base' argument to block_stream.  If base
 is specified then data contained in the base image will not be copied.

This is a present requirement.

  This can be used to merge data from an intermediate image without
 merging the base image.  When streaming completes the backing file
 will be set to the base image.  The backing file relationship would
 typically look like this:
 
 1. Before block_stream -a -b base.img ide0-hd completion:
 
 base.img - sn1 - ... - ide0-hd.qed
 
 2. After streaming completes:
 
 base.img - ide0-hd.qed
 
 This describes the image streaming use cases that I, Adam, and Anthony
 propose to support.  In the course of the discussion we've sometimes
 been distracted with the internals of what a unified live block
 copy/image streaming implementation should do.  I wanted to post this
 summary of image streaming to refocus us on the use case and the APIs
 that users will see.
 
 Stefan

OK, with an external COW file for formats that do not support it the
interface can be similar. Also there is no need to mirror writes,
no switch operation, always use destination image.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Preparatory perf patches for KVM PMU support

2011-06-29 Thread Avi Kivity
The following three patches pave the way for KVM in-guest performance
monitoring.  One is a perf API improvement, another fixes the constraints
for the version 1 architectural PMU (which we will emulate), and the third
adds an export that KVM will use.

Please consider for merging; this will make further work on the KVM PMU
easier.

Avi Kivity (3):
  perf: add context field to perf_event
  x86, perf: add constraints for architectural PMU v1
  perf: export perf_event_refresh() to modules

 arch/arm/kernel/ptrace.c|3 ++-
 arch/powerpc/kernel/ptrace.c|2 +-
 arch/sh/kernel/ptrace_32.c  |3 ++-
 arch/x86/kernel/cpu/perf_event_intel.c  |   23 ++-
 arch/x86/kernel/kgdb.c  |2 +-
 arch/x86/kernel/ptrace.c|3 ++-
 drivers/oprofile/oprofile_perf.c|2 +-
 include/linux/hw_breakpoint.h   |   10 --
 include/linux/perf_event.h  |9 -
 kernel/events/core.c|   24 +---
 kernel/events/hw_breakpoint.c   |   10 +++---
 kernel/watchdog.c   |2 +-
 samples/hw_breakpoint/data_breakpoint.c |2 +-
 13 files changed, 69 insertions(+), 26 deletions(-)

-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] perf: add context field to perf_event

2011-06-29 Thread Avi Kivity
The perf_event overflow handler does not receive any caller-derived
argument, so many callers need to resort to looking up the perf_event
in their local data structure.  This is ugly and doesn't scale if a
single callback services many perf_events.

Fix by adding a context parameter to perf_event_create_kernel_counter()
(and derived hardware breakpoints APIs) and storing it in the perf_event.
The field can be accessed from the callback as event-overflow_handler_context.
All callers are updated.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/arm/kernel/ptrace.c|3 ++-
 arch/powerpc/kernel/ptrace.c|2 +-
 arch/sh/kernel/ptrace_32.c  |3 ++-
 arch/x86/kernel/kgdb.c  |2 +-
 arch/x86/kernel/ptrace.c|3 ++-
 drivers/oprofile/oprofile_perf.c|2 +-
 include/linux/hw_breakpoint.h   |   10 --
 include/linux/perf_event.h  |4 +++-
 kernel/events/core.c|   21 +++--
 kernel/events/hw_breakpoint.c   |   10 +++---
 kernel/watchdog.c   |2 +-
 samples/hw_breakpoint/data_breakpoint.c |2 +-
 12 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 9726006..4911c94 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct 
task_struct *tsk, int type)
attr.bp_type= type;
attr.disabled   = 1;
 
-   return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk);
+   return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL,
+  tsk);
 }
 
 static int ptrace_gethbpregs(struct task_struct *tsk, long num,
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index cb22024..5249308 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,
attr.bp_type);
 
thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr,
-   ptrace_triggered, task);
+  ptrace_triggered, NULL, task);
if (IS_ERR(bp)) {
thread-ptrace_bps[0] = NULL;
ptrace_put_breakpoints(task);
diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
index 3d7b209..930312f 100644
--- a/arch/sh/kernel/ptrace_32.c
+++ b/arch/sh/kernel/ptrace_32.c
@@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned 
long addr)
attr.bp_len = HW_BREAKPOINT_LEN_2;
attr.bp_type = HW_BREAKPOINT_R;
 
-   bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
+   bp = register_user_hw_breakpoint(attr, ptrace_triggered,
+NULL, tsk);
if (IS_ERR(bp))
return PTR_ERR(bp);
 
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 5f9ecff..473ab53 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -638,7 +638,7 @@ void kgdb_arch_late(void)
for (i = 0; i  HBP_NUM; i++) {
if (breakinfo[i].pev)
continue;
-   breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL);
+   breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, 
NULL);
if (IS_ERR((void * __force)breakinfo[i].pev)) {
printk(KERN_ERR kgdb: Could not allocate hw
   breakpoints\nDisabling the kernel debugger\n);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 807c2a2..28092ae 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct 
*tsk, int nr,
attr.bp_type = HW_BREAKPOINT_W;
attr.disabled = 1;
 
-   bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
+   bp = register_user_hw_breakpoint(attr, ptrace_triggered,
+NULL, tsk);
 
/*
 * CHECKME: the previous code returned -EIO if the addr wasn't
diff --git a/drivers/oprofile/oprofile_perf.c b/drivers/oprofile/oprofile_perf.c
index 9046f7b..59acf9e 100644
--- a/drivers/oprofile/oprofile_perf.c
+++ b/drivers/oprofile/oprofile_perf.c
@@ -79,7 +79,7 @@ static int op_create_counter(int cpu, int event)
 
pevent = perf_event_create_kernel_counter(counter_config[event].attr,
  cpu, NULL,
- op_overflow_handler);
+

[PATCH 3/3] perf: export perf_event_refresh() to modules

2011-06-29 Thread Avi Kivity
KVM needs one-shot samples, since a PMC programmed to -X will fire after X
events and then again after 2^40 events (i.e. variable period).

Signed-off-by: Avi Kivity a...@redhat.com
---
 include/linux/perf_event.h |5 +
 kernel/events/core.c   |3 ++-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 40264b5..91342ac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -973,6 +973,7 @@ extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
+extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
@@ -1168,6 +1169,10 @@ static inline void perf_event_delayed_put(struct 
task_struct *task)  { }
 static inline void perf_event_print_debug(void)
{ }
 static inline int perf_event_task_disable(void)
{ return -EINVAL; }
 static inline int perf_event_task_enable(void) { 
return -EINVAL; }
+static inline int perf_event_refresh(struct perf_event *event, int refresh)
+{
+   return -EINVAL;
+}
 
 static inline void
 perf_sw_event(u32 event_id, u64 nr, int nmi,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6dd4819..f69cc9f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1739,7 +1739,7 @@ out:
raw_spin_unlock_irq(ctx-lock);
 }
 
-static int perf_event_refresh(struct perf_event *event, int refresh)
+int perf_event_refresh(struct perf_event *event, int refresh)
 {
/*
 * not supported on inherited events
@@ -1752,6 +1752,7 @@ static int perf_event_refresh(struct perf_event *event, 
int refresh)
 
return 0;
 }
+EXPORT_SYMBOL_GPL(perf_event_refresh);
 
 static void ctx_sched_out(struct perf_event_context *ctx,
  struct perf_cpu_context *cpuctx,
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] x86, perf: add constraints for architectural PMU v1

2011-06-29 Thread Avi Kivity
The v1 PMU does not have any fixed counters.  Using the v2 constraints,
which do have fixed counters, causes an additional choice to be present
in the weight calculation, but not when actually scheduling the event,
leading to an event being not scheduled at all.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kernel/cpu/perf_event_intel.c |   23 ++-
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 41178c8..b46b70e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -137,6 +137,11 @@ static struct event_constraint 
intel_westmere_percore_constraints[] __read_mostl
EVENT_CONSTRAINT_END
 };
 
+static struct event_constraint intel_v1_event_constraints[] __read_mostly =
+{
+   EVENT_CONSTRAINT_END
+};
+
 static struct event_constraint intel_gen_event_constraints[] __read_mostly =
 {
FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
@@ -1512,11 +1517,19 @@ static __init int intel_pmu_init(void)
break;
 
default:
-   /*
-* default constraints for v2 and up
-*/
-   x86_pmu.event_constraints = intel_gen_event_constraints;
-   pr_cont(generic architected perfmon, );
+   switch (x86_pmu.version) {
+   case 1:
+   x86_pmu.event_constraints = intel_v1_event_constraints;
+   pr_cont(generic architected perfmon v1, );
+   break;
+   default:
+   /*
+* default constraints for v2 and up
+*/
+   x86_pmu.event_constraints = intel_gen_event_constraints;
+   pr_cont(generic architected perfmon, );
+   break;
+   }
}
return 0;
 }
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm monitor socket - connection refused

2011-06-29 Thread Michael Tokarev
29.06.2011 19:20, Iordan Iordanov wrote:
 
 
 On 06/28/11 18:29, Michael Tokarev wrote:
 The process listening on this socket no longer exist,
 it finished.  With this command line it should stay in
 foreground till finished (there's no -daemonize etc),
 so you should see error messages if any.
 
 The kvm command was backgrounded, not -daemonize(d). It was still
 running, and I was accessing the VM via VNC.

So kvm was running at the time you tried to access
the mointor.

 How about checking who is actually listening on this
 socket before asking?
 
 I thought it's the kvm process that listens on the socket. I haven't
 seen other processes spun off by kvm until now. Is that not the case?

It is the kvm process that listens on the socket, it spawns
no other processes.

The only other explanation I can think of is that you tried
to run two instances of kvm, and when second instance initialized
it re-created the monitor socket but failed later (eg, when
initin network or something else) and exited, but left the
stray socket (JFYI, you can remove a unix-domain socket
where some process is listening, and create another - that
one will really be different socket, even if named the same
way, -- just like you can re-create a plain file the same
way).

In any way, there hasn't been any problems/bugs in that area
for ages.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-06-29 Thread Frederic Weisbecker
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
 The perf_event overflow handler does not receive any caller-derived
 argument, so many callers need to resort to looking up the perf_event
 in their local data structure.  This is ugly and doesn't scale if a
 single callback services many perf_events.
 
 Fix by adding a context parameter to perf_event_create_kernel_counter()
 (and derived hardware breakpoints APIs) and storing it in the perf_event.
 The field can be accessed from the callback as 
 event-overflow_handler_context.
 All callers are updated.
 
 Signed-off-by: Avi Kivity a...@redhat.com

I believe it can micro-optimize ptrace through register_user_hw_breakpoint() 
because
we could store the index of the breakpoint that way, instead of iterating 
through 4 slots.

Perhaps it can help in arm too, adding Will in Cc.

But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may 
be Jason
could find some use of it.

 ---
  arch/arm/kernel/ptrace.c|3 ++-
  arch/powerpc/kernel/ptrace.c|2 +-
  arch/sh/kernel/ptrace_32.c  |3 ++-
  arch/x86/kernel/kgdb.c  |2 +-
  arch/x86/kernel/ptrace.c|3 ++-
  drivers/oprofile/oprofile_perf.c|2 +-
  include/linux/hw_breakpoint.h   |   10 --
  include/linux/perf_event.h  |4 +++-
  kernel/events/core.c|   21 +++--
  kernel/events/hw_breakpoint.c   |   10 +++---
  kernel/watchdog.c   |2 +-
  samples/hw_breakpoint/data_breakpoint.c |2 +-
  12 files changed, 44 insertions(+), 20 deletions(-)
 
 diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
 index 9726006..4911c94 100644
 --- a/arch/arm/kernel/ptrace.c
 +++ b/arch/arm/kernel/ptrace.c
 @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct 
 task_struct *tsk, int type)
   attr.bp_type= type;
   attr.disabled   = 1;
  
 - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk);
 + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL,
 +tsk);
  }
  
  static int ptrace_gethbpregs(struct task_struct *tsk, long num,
 diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
 index cb22024..5249308 100644
 --- a/arch/powerpc/kernel/ptrace.c
 +++ b/arch/powerpc/kernel/ptrace.c
 @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, 
 unsigned long addr,
   attr.bp_type);
  
   thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr,
 - ptrace_triggered, task);
 +ptrace_triggered, NULL, task);
   if (IS_ERR(bp)) {
   thread-ptrace_bps[0] = NULL;
   ptrace_put_breakpoints(task);
 diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
 index 3d7b209..930312f 100644
 --- a/arch/sh/kernel/ptrace_32.c
 +++ b/arch/sh/kernel/ptrace_32.c
 @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, 
 unsigned long addr)
   attr.bp_len = HW_BREAKPOINT_LEN_2;
   attr.bp_type = HW_BREAKPOINT_R;
  
 - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
 + bp = register_user_hw_breakpoint(attr, ptrace_triggered,
 +  NULL, tsk);
   if (IS_ERR(bp))
   return PTR_ERR(bp);
  
 diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
 index 5f9ecff..473ab53 100644
 --- a/arch/x86/kernel/kgdb.c
 +++ b/arch/x86/kernel/kgdb.c
 @@ -638,7 +638,7 @@ void kgdb_arch_late(void)
   for (i = 0; i  HBP_NUM; i++) {
   if (breakinfo[i].pev)
   continue;
 - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL);
 + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, 
 NULL);
   if (IS_ERR((void * __force)breakinfo[i].pev)) {
   printk(KERN_ERR kgdb: Could not allocate hw
  breakpoints\nDisabling the kernel debugger\n);
 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index 807c2a2..28092ae 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct 
 *tsk, int nr,
   attr.bp_type = HW_BREAKPOINT_W;
   attr.disabled = 1;
  
 - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
 + bp = register_user_hw_breakpoint(attr, ptrace_triggered,
 +  NULL, tsk);
  
   /*
* CHECKME: the previous code returned -EIO if the addr wasn't
 diff --git a/drivers/oprofile/oprofile_perf.c 
 

Re: [PATCH 1/3] perf: add context field to perf_event

2011-06-29 Thread Avi Kivity

On 06/29/2011 07:08 PM, Frederic Weisbecker wrote:

On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
  The perf_event overflow handler does not receive any caller-derived
  argument, so many callers need to resort to looking up the perf_event
  in their local data structure.  This is ugly and doesn't scale if a
  single callback services many perf_events.

  Fix by adding a context parameter to perf_event_create_kernel_counter()
  (and derived hardware breakpoints APIs) and storing it in the perf_event.
  The field can be accessed from the callback as 
event-overflow_handler_context.
  All callers are updated.

  Signed-off-by: Avi Kivitya...@redhat.com

I believe it can micro-optimize ptrace through register_user_hw_breakpoint() 
because
we could store the index of the breakpoint that way, instead of iterating 
through 4 slots.



Right, I noticed that while writing the patch.


Perhaps it can help in arm too, adding Will in Cc.

But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may 
be Jason
could find some use of it.


I think an API should not require its users to iterate in their 
callbacks, even if it doesn't affect current users for some reason.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-06-29 Thread Will Deacon
Hi Frederic,

Thanks for including me on CC.

On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote:
 On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
  The perf_event overflow handler does not receive any caller-derived
  argument, so many callers need to resort to looking up the perf_event
  in their local data structure.  This is ugly and doesn't scale if a
  single callback services many perf_events.
 
  Fix by adding a context parameter to perf_event_create_kernel_counter()
  (and derived hardware breakpoints APIs) and storing it in the perf_event.
  The field can be accessed from the callback as 
  event-overflow_handler_context.
  All callers are updated.
 
  Signed-off-by: Avi Kivity a...@redhat.com
 
 I believe it can micro-optimize ptrace through register_user_hw_breakpoint() 
 because
 we could store the index of the breakpoint that way, instead of iterating 
 through 4 slots.
 
 Perhaps it can help in arm too, adding Will in Cc.

Yes, we could store the breakpoint index in there and it would save us
walking over the breakpoints when one fires. Not sure this helps us for
anything else though. My main gripe with the ptrace interface to
hw_breakpoints is that we have to convert all the breakpoint information
from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again
in the hw_breakpoint code. Yuck!

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] kvm tools: Don't dynamically allocate threadpool jobs

2011-06-29 Thread Sasha Levin
To allow efficient use of shorter-term threadpool jobs, don't
allocate them dynamically upon creation. Instead, store them
within 'job' structures.

This will prevent some overhead creating/destroying jobs which live
for a short time.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/threadpool.h |   29 ++---
 tools/kvm/threadpool.c |   30 ++
 tools/kvm/virtio/9p.c  |   12 ++--
 tools/kvm/virtio/blk.c |8 
 tools/kvm/virtio/console.c |   10 +-
 tools/kvm/virtio/rng.c |   16 
 6 files changed, 51 insertions(+), 54 deletions(-)

diff --git a/tools/kvm/include/kvm/threadpool.h 
b/tools/kvm/include/kvm/threadpool.h
index 62826a6..768239f 100644
--- a/tools/kvm/include/kvm/threadpool.h
+++ b/tools/kvm/include/kvm/threadpool.h
@@ -1,14 +1,37 @@
 #ifndef KVM__THREADPOOL_H
 #define KVM__THREADPOOL_H
 
+#include kvm/mutex.h
+
+#include linux/list.h
+
 struct kvm;
 
 typedef void (*kvm_thread_callback_fn_t)(struct kvm *kvm, void *data);
 
-int thread_pool__init(unsigned long thread_count);
+struct thread_pool__job {
+   kvm_thread_callback_fn_tcallback;
+   struct kvm  *kvm;
+   void*data;
+
+   int signalcount;
+   pthread_mutex_t mutex;
 
-void *thread_pool__add_job(struct kvm *kvm, kvm_thread_callback_fn_t callback, 
void *data);
+   struct list_headqueue;
+};
+
+static inline void thread_pool__init_job(struct thread_pool__job *job, struct 
kvm *kvm, kvm_thread_callback_fn_t callback, void *data)
+{
+   *job = (struct thread_pool__job) {
+   .kvm= kvm,
+   .callback   = callback,
+   .data   = data,
+   .mutex  = PTHREAD_MUTEX_INITIALIZER,
+   };
+}
+
+int thread_pool__init(unsigned long thread_count);
 
-void thread_pool__do_job(void *job);
+void thread_pool__do_job(struct thread_pool__job *job);
 
 #endif
diff --git a/tools/kvm/threadpool.c b/tools/kvm/threadpool.c
index 2db02184..fdc5fa7 100644
--- a/tools/kvm/threadpool.c
+++ b/tools/kvm/threadpool.c
@@ -6,17 +6,6 @@
 #include pthread.h
 #include stdbool.h
 
-struct thread_pool__job {
-   kvm_thread_callback_fn_tcallback;
-   struct kvm  *kvm;
-   void*data;
-
-   int signalcount;
-   pthread_mutex_t mutex;
-
-   struct list_headqueue;
-};
-
 static pthread_mutex_t job_mutex   = PTHREAD_MUTEX_INITIALIZER;
 static pthread_mutex_t thread_mutex= PTHREAD_MUTEX_INITIALIZER;
 static pthread_cond_t  job_cond= PTHREAD_COND_INITIALIZER;
@@ -139,26 +128,11 @@ int thread_pool__init(unsigned long thread_count)
return i;
 }
 
-void *thread_pool__add_job(struct kvm *kvm,
-  kvm_thread_callback_fn_t callback, void *data)
-{
-   struct thread_pool__job *job = calloc(1, sizeof(*job));
-
-   *job = (struct thread_pool__job) {
-   .kvm= kvm,
-   .data   = data,
-   .callback   = callback,
-   .mutex  = PTHREAD_MUTEX_INITIALIZER
-   };
-
-   return job;
-}
-
-void thread_pool__do_job(void *job)
+void thread_pool__do_job(struct thread_pool__job *job)
 {
struct thread_pool__job *jobinfo = job;
 
-   if (jobinfo == NULL)
+   if (jobinfo == NULL || jobinfo-callback == NULL)
return;
 
mutex_lock(jobinfo-mutex);
diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
index d2d738d..b1a8c01 100644
--- a/tools/kvm/virtio/9p.c
+++ b/tools/kvm/virtio/9p.c
@@ -46,9 +46,9 @@ struct p9_fid {
 };
 
 struct p9_dev_job {
-   struct virt_queue   *vq;
-   struct p9_dev   *p9dev;
-   void*job_id;
+   struct virt_queue   *vq;
+   struct p9_dev   *p9dev;
+   struct thread_pool__job job_id;
 };
 
 struct p9_dev {
@@ -696,7 +696,7 @@ static void ioevent_callback(struct kvm *kvm, void *param)
 {
struct p9_dev_job *job = param;
 
-   thread_pool__do_job(job-job_id);
+   thread_pool__do_job(job-job_id);
 }
 
 static bool virtio_p9_pci_io_out(struct ioport *ioport, struct kvm *kvm,
@@ -731,7 +731,7 @@ static bool virtio_p9_pci_io_out(struct ioport *ioport, 
struct kvm *kvm,
.vq = queue,
.p9dev  = p9dev,
};
-   job-job_id = thread_pool__add_job(kvm, virtio_p9_do_io, job);
+   thread_pool__init_job(job-job_id, kvm, virtio_p9_do_io, job);
 
ioevent = (struct ioevent) {
.io_addr= p9dev-base_addr + 

[PATCH 2/9] kvm tools: Process virtio-blk requests in parallel

2011-06-29 Thread Sasha Levin
Process multiple requests within a virtio-blk device's vring
in parallel.

Doing so may improve performance in cases when a request which can
be completed using data which is present in a cache is queued after
a request with un-cached data.

bonnie++ benchmarks have shown a 6% improvement with reads, and 2%
improvement in writes.

Suggested-by: Anthony Liguori aligu...@us.ibm.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/virtio/blk.c |   74 ---
 1 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
index 1fdfc1e..f2a728c 100644
--- a/tools/kvm/virtio/blk.c
+++ b/tools/kvm/virtio/blk.c
@@ -31,6 +31,8 @@
 struct blk_dev_job {
struct virt_queue   *vq;
struct blk_dev  *bdev;
+   struct ioveciov[VIRTIO_BLK_QUEUE_SIZE];
+   u16 out, in, head;
struct thread_pool__job job_id;
 };
 
@@ -51,7 +53,8 @@ struct blk_dev {
u16 queue_selector;
 
struct virt_queue   vqs[NUM_VIRT_QUEUES];
-   struct blk_dev_job  jobs[NUM_VIRT_QUEUES];
+   struct blk_dev_job  jobs[VIRTIO_BLK_QUEUE_SIZE];
+   u16 job_idx;
struct pci_device_headerpci_hdr;
 };
 
@@ -118,20 +121,26 @@ static bool virtio_blk_pci_io_in(struct ioport *ioport, 
struct kvm *kvm, u16 por
return ret;
 }
 
-static bool virtio_blk_do_io_request(struct kvm *kvm,
-   struct blk_dev *bdev,
-   struct virt_queue *queue)
+static void virtio_blk_do_io_request(struct kvm *kvm, void *param)
 {
-   struct iovec iov[VIRTIO_BLK_QUEUE_SIZE];
struct virtio_blk_outhdr *req;
-   ssize_t block_cnt = -1;
-   u16 out, in, head;
u8 *status;
+   ssize_t block_cnt;
+   struct blk_dev_job *job;
+   struct blk_dev *bdev;
+   struct virt_queue *queue;
+   struct iovec *iov;
+   u16 out, in, head;
 
-   head= virt_queue__get_iov(queue, iov, out, in, 
kvm);
-
-   /* head */
-   req = iov[0].iov_base;
+   block_cnt   = -1;
+   job = param;
+   bdev= job-bdev;
+   queue   = job-vq;
+   iov = job-iov;
+   out = job-out;
+   in  = job-in;
+   head= job-head;
+   req = iov[0].iov_base;
 
switch (req-type) {
case VIRTIO_BLK_T_IN:
@@ -153,24 +162,27 @@ static bool virtio_blk_do_io_request(struct kvm *kvm,
status  = iov[out + in - 1].iov_base;
*status = (block_cnt  0) ? VIRTIO_BLK_S_IOERR : 
VIRTIO_BLK_S_OK;
 
+   mutex_lock(bdev-mutex);
virt_queue__set_used_elem(queue, head, block_cnt);
+   mutex_unlock(bdev-mutex);
 
-   return true;
+   virt_queue__trigger_irq(queue, bdev-pci_hdr.irq_line, bdev-isr, kvm);
 }
 
-static void virtio_blk_do_io(struct kvm *kvm, void *param)
+static void virtio_blk_do_io(struct kvm *kvm, struct virt_queue *vq, struct 
blk_dev *bdev)
 {
-   struct blk_dev_job *job = param;
-   struct virt_queue *vq;
-   struct blk_dev *bdev;
+   while (virt_queue__available(vq)) {
+   struct blk_dev_job *job = bdev-jobs[bdev-job_idx++ % 
VIRTIO_BLK_QUEUE_SIZE];
 
-   vq  = job-vq;
-   bdev= job-bdev;
-
-   while (virt_queue__available(vq))
-   virtio_blk_do_io_request(kvm, bdev, vq);
+   *job= (struct blk_dev_job) {
+   .vq = vq,
+   .bdev   = bdev,
+   };
+   job-head = virt_queue__get_iov(vq, job-iov, job-out, 
job-in, kvm);
 
-   virt_queue__trigger_irq(vq, bdev-pci_hdr.irq_line, bdev-isr, kvm);
+   thread_pool__init_job(job-job_id, kvm, 
virtio_blk_do_io_request, job);
+   thread_pool__do_job(job-job_id);
+   }
 }
 
 static bool virtio_blk_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size, u32 count)
@@ -190,24 +202,14 @@ static bool virtio_blk_pci_io_out(struct ioport *ioport, 
struct kvm *kvm, u16 po
break;
case VIRTIO_PCI_QUEUE_PFN: {
struct virt_queue *queue;
-   struct blk_dev_job *job;
void *p;
 
-   job = bdev-jobs[bdev-queue_selector];
-
queue   = bdev-vqs[bdev-queue_selector];
queue-pfn  = ioport__read32(data);
p   = guest_pfn_to_host(kvm, queue-pfn);
 
vring_init(queue-vring, VIRTIO_BLK_QUEUE_SIZE, p, 
VIRTIO_PCI_VRING_ALIGN);
 
-

[PATCH 3/9] kvm tools: Allow giving instance names

2011-06-29 Thread Sasha Levin
This will allow tracking instance names and sending commands
to specific instances if multiple instances are running.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/kvm.h |5 +++-
 tools/kvm/kvm-run.c |5 +++-
 tools/kvm/kvm.c |   55 ++-
 tools/kvm/term.c|3 ++
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h
index 7d90d35..5ad3236 100644
--- a/tools/kvm/include/kvm/kvm.h
+++ b/tools/kvm/include/kvm/kvm.h
@@ -41,9 +41,11 @@ struct kvm {
const char  *vmlinux;
struct disk_image   **disks;
int nr_disks;
+
+   const char  *name;
 };
 
-struct kvm *kvm__init(const char *kvm_dev, u64 ram_size);
+struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, const char *name);
 int kvm__max_cpus(struct kvm *kvm);
 void kvm__init_ram(struct kvm *kvm);
 void kvm__delete(struct kvm *kvm);
@@ -61,6 +63,7 @@ bool kvm__deregister_mmio(struct kvm *kvm, u64 phys_addr);
 void kvm__pause(void);
 void kvm__continue(void);
 void kvm__notify_paused(void);
+int kvm__get_pid_by_instance(const char *name);
 
 /*
  * Debugging
diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index 0dece2d..a4abf76 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -69,6 +69,7 @@ static const char *network;
 static const char *host_ip_addr;
 static const char *guest_mac;
 static const char *script;
+static const char *guest_name;
 static bool single_step;
 static bool readonly_image[MAX_DISK_IMAGES];
 static bool vnc;
@@ -132,6 +133,8 @@ static int virtio_9p_rootdir_parser(const struct option 
*opt, const char *arg, i
 
 static const struct option options[] = {
OPT_GROUP(Basic options:),
+   OPT_STRING('\0', name, guest_name, guest name,
+   A name for the guest),
OPT_INTEGER('c', cpus, nrcpus, Number of CPUs),
OPT_U64('m', mem, ram_size, Virtual machine memory size in MiB.),
OPT_CALLBACK('d', disk, NULL, image, Disk image, img_name_parser),
@@ -546,7 +549,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
 
term_init();
 
-   kvm = kvm__init(kvm_dev, ram_size);
+   kvm = kvm__init(kvm_dev, ram_size, guest_name);
 
ioeventfd__init();
 
diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index c400c70..4f723a6 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -113,11 +113,60 @@ static struct kvm *kvm__new(void)
return kvm;
 }
 
+static void kvm__create_pidfile(struct kvm *kvm)
+{
+   int fd;
+   char full_name[PATH_MAX], pid[10];
+
+   if (!kvm-name)
+   return;
+
+   mkdir(/var/run/kvm-tools, 0777);
+   sprintf(full_name, /var/run/kvm-tools/%s.pid, kvm-name);
+   fd = open(full_name, O_CREAT | O_WRONLY, 0666);
+   sprintf(pid, %u\n, getpid());
+   if (write(fd, pid, strlen(pid)) = 0)
+   die(Failed creating PID file);
+   close(fd);
+}
+
+static void kvm__remove_pidfile(struct kvm *kvm)
+{
+   char full_name[PATH_MAX];
+
+   if (!kvm-name)
+   return;
+
+   sprintf(full_name, /var/run/kvm-tools/%s.pid, kvm-name);
+   unlink(full_name);
+}
+
+int kvm__get_pid_by_instance(const char *name)
+{
+   int fd, pid;
+   char pid_str[10], pid_file[PATH_MAX];
+
+   sprintf(pid_file, /var/run/kvm-tools/%s.pid, name);
+   fd = open(pid_file, O_RDONLY);
+   if (fd  0)
+   return -1;
+
+   if (read(fd, pid_str, 10) == 0)
+   return -1;
+
+   pid = atoi(pid_str);
+   if (pid  0)
+   return -1;
+
+   return pid;
+}
+
 void kvm__delete(struct kvm *kvm)
 {
kvm__stop_timer(kvm);
 
munmap(kvm-ram_start, kvm-ram_size);
+   kvm__remove_pidfile(kvm);
free(kvm);
 }
 
@@ -237,7 +286,7 @@ int kvm__max_cpus(struct kvm *kvm)
return ret;
 }
 
-struct kvm *kvm__init(const char *kvm_dev, u64 ram_size)
+struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, const char *name)
 {
struct kvm_pit_config pit_config = { .flags = 0, };
struct kvm *kvm;
@@ -300,6 +349,10 @@ struct kvm *kvm__init(const char *kvm_dev, u64 ram_size)
if (ret  0)
die_perror(KVM_CREATE_IRQCHIP ioctl);
 
+   kvm-name = name;
+
+   kvm__create_pidfile(kvm);
+
return kvm;
 }
 
diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index 9947223..a0cb03f 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -9,7 +9,9 @@
 #include kvm/read-write.h
 #include kvm/term.h
 #include kvm/util.h
+#include kvm/kvm.h
 
+extern struct kvm *kvm;
 static struct termios  orig_term;
 
 int term_escape_char   = 0x01; /* ctrl-a is used for escape */
@@ -32,6 +34,7 @@ int term_getc(int who)
if (term_got_escape) {
term_got_escape = false;
if (c == 'x') {
+   

[PATCH 4/9] kvm tools: Provide instance name when running 'kvm debug'

2011-06-29 Thread Sasha Levin
Instead of sending a signal to the first instance found, send it
to a specific instance.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/kvm-debug.c |   19 +++
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/tools/kvm/kvm-debug.c b/tools/kvm/kvm-debug.c
index 58782dd..432ae84 100644
--- a/tools/kvm/kvm-debug.c
+++ b/tools/kvm/kvm-debug.c
@@ -1,11 +1,22 @@
-#include stdio.h
-#include string.h
-
 #include kvm/util.h
 #include kvm/kvm-cmd.h
 #include kvm/kvm-debug.h
+#include kvm/kvm.h
+
+#include stdio.h
+#include string.h
+#include signal.h
 
 int kvm_cmd_debug(int argc, const char **argv, const char *prefix)
 {
-   return system(kill -3 $(pidof kvm));
+   int pid;
+
+   if (argc != 1)
+   die(Usage: kvm debug [instance name]\n);
+
+   pid = kvm__get_pid_by_instance(argv[0]);
+   if (pid  0)
+   die(Failed locating instance name);
+
+   return kill(pid, SIGQUIT);
 }
-- 
1.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] kvm tools: Provide instance name when running 'kvm pause'

2011-06-29 Thread Sasha Levin
Instead of sending a signal to the first instance found, send it
to a specific instance.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/kvm-pause.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/kvm/kvm-pause.c b/tools/kvm/kvm-pause.c
index fdf8714..0cb6f29 100644
--- a/tools/kvm/kvm-pause.c
+++ b/tools/kvm/kvm-pause.c
@@ -5,9 +5,18 @@
 #include kvm/util.h
 #include kvm/kvm-cmd.h
 #include kvm/kvm-pause.h
+#include kvm/kvm.h
 
 int kvm_cmd_pause(int argc, const char **argv, const char *prefix)
 {
-   signal(SIGUSR2, SIG_IGN);
-   return system(kill -USR2 $(pidof kvm));
+   int pid;
+
+   if (argc != 1)
+   die(Usage: kvm debug [instance name]\n);
+
+   pid = kvm__get_pid_by_instance(argv[0]);
+   if (pid  0)
+   die(Failed locating instance name);
+
+   return kill(pid, SIGUSR2);
 }
-- 
1.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] kvm tools: Advise memory allocated for guest RAM as KSM mergable

2011-06-29 Thread Sasha Levin
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/kvm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index 4f723a6..15bcf08 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -345,6 +345,8 @@ struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, 
const char *name)
if (kvm-ram_start == MAP_FAILED)
die(out of memory);
 
+   madvise(kvm-ram_start, kvm-ram_size, MADV_MERGEABLE);
+
ret = ioctl(kvm-vm_fd, KVM_CREATE_IRQCHIP);
if (ret  0)
die_perror(KVM_CREATE_IRQCHIP ioctl);
-- 
1.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] kvm tools: Add virtio-balloon device

2011-06-29 Thread Sasha Levin
From the virtio spec:

The virtio memory balloon device is a primitive device for managing guest
memory: the device asks for a certain amount of memory, and the guest supplies
it (or withdraws it, if the device has more than it asks for). This allows the
guest to adapt to changes in allowance of underlying physical memory.

To activate the virtio-balloon device run kvm tools with the '--balloon'
command line parameter.

Current implementation listens for two signals:

 - SIGKVMADDMEM: Adds 1M to the balloon driver (inflate). This will decrease
available memory within the guest.
 - SIGKVMDELMEM: Remove 1M from the balloon driver (deflate). This will
increase available memory within the guest.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/Makefile |1 +
 tools/kvm/include/kvm/kvm.h|3 +
 tools/kvm/include/kvm/virtio-balloon.h |8 +
 tools/kvm/include/kvm/virtio-pci-dev.h |1 +
 tools/kvm/kvm-run.c|6 +
 tools/kvm/virtio/balloon.c |  265 
 6 files changed, 284 insertions(+), 0 deletions(-)
 create mode 100644 tools/kvm/include/kvm/virtio-balloon.h
 create mode 100644 tools/kvm/virtio/balloon.c

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index d368c22..a1b2f4c 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -40,6 +40,7 @@ OBJS  += virtio/console.o
 OBJS   += virtio/core.o
 OBJS   += virtio/net.o
 OBJS   += virtio/rng.o
+OBJS+= virtio/balloon.o
 OBJS   += disk/blk.o
 OBJS   += disk/qcow.o
 OBJS   += disk/raw.o
diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h
index 5ad3236..1fdfcf7 100644
--- a/tools/kvm/include/kvm/kvm.h
+++ b/tools/kvm/include/kvm/kvm.h
@@ -6,6 +6,7 @@
 #include stdbool.h
 #include linux/types.h
 #include time.h
+#include signal.h
 
 #define KVM_NR_CPUS(255)
 
@@ -17,6 +18,8 @@
 
 #define SIGKVMEXIT (SIGRTMIN + 0)
 #define SIGKVMPAUSE(SIGRTMIN + 1)
+#define SIGKVMADDMEM   (SIGRTMIN + 2)
+#define SIGKVMDELMEM   (SIGRTMIN + 3)
 
 struct kvm {
int sys_fd; /* For system ioctls(), i.e. 
/dev/kvm */
diff --git a/tools/kvm/include/kvm/virtio-balloon.h 
b/tools/kvm/include/kvm/virtio-balloon.h
new file mode 100644
index 000..eb49fd4
--- /dev/null
+++ b/tools/kvm/include/kvm/virtio-balloon.h
@@ -0,0 +1,8 @@
+#ifndef KVM__BLN_VIRTIO_H
+#define KVM__BLN_VIRTIO_H
+
+struct kvm;
+
+void virtio_bln__init(struct kvm *kvm);
+
+#endif /* KVM__BLN_VIRTIO_H */
diff --git a/tools/kvm/include/kvm/virtio-pci-dev.h 
b/tools/kvm/include/kvm/virtio-pci-dev.h
index ca373df..4eee831 100644
--- a/tools/kvm/include/kvm/virtio-pci-dev.h
+++ b/tools/kvm/include/kvm/virtio-pci-dev.h
@@ -12,6 +12,7 @@
 #define PCI_DEVICE_ID_VIRTIO_BLK   0x1001
 #define PCI_DEVICE_ID_VIRTIO_CONSOLE   0x1003
 #define PCI_DEVICE_ID_VIRTIO_RNG   0x1004
+#define PCI_DEVICE_ID_VIRTIO_BLN   0x1005
 #define PCI_DEVICE_ID_VIRTIO_P90x1009
 #define PCI_DEVICE_ID_VESA 0x2000
 
diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index a4abf76..3b1d586 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -18,6 +18,7 @@
 #include kvm/virtio-net.h
 #include kvm/virtio-console.h
 #include kvm/virtio-rng.h
+#include kvm/virtio-balloon.h
 #include kvm/disk-image.h
 #include kvm/util.h
 #include kvm/pci.h
@@ -74,6 +75,7 @@ static bool single_step;
 static bool readonly_image[MAX_DISK_IMAGES];
 static bool vnc;
 static bool sdl;
+static bool balloon;
 extern bool ioport_debug;
 extern int  active_console;
 extern int  debug_iodelay;
@@ -145,6 +147,7 @@ static const struct option options[] = {
OPT_STRING('\0', kvm-dev, kvm_dev, kvm-dev, KVM device file),
OPT_CALLBACK('\0', virtio-9p, NULL, dirname,tag_name,
 Enable 9p over virtio, virtio_9p_rootdir_parser),
+   OPT_BOOLEAN('\0', balloon, balloon, Enable virtio balloon),
OPT_BOOLEAN('\0', vnc, vnc, Enable VNC framebuffer),
OPT_BOOLEAN('\0', sdl, sdl, Enable SDL framebuffer),
 
@@ -629,6 +632,9 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
while (virtio_rng--)
virtio_rng__init(kvm);
 
+   if (balloon)
+   virtio_bln__init(kvm);
+
if (!network)
network = DEFAULT_NETWORK;
 
diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c
new file mode 100644
index 000..ab9ccb7
--- /dev/null
+++ b/tools/kvm/virtio/balloon.c
@@ -0,0 +1,265 @@
+#include kvm/virtio-balloon.h
+
+#include kvm/virtio-pci-dev.h
+
+#include kvm/disk-image.h
+#include kvm/virtio.h
+#include kvm/ioport.h
+#include kvm/util.h
+#include kvm/kvm.h
+#include kvm/pci.h
+#include kvm/threadpool.h
+#include kvm/irq.h
+#include kvm/ioeventfd.h
+
+#include linux/virtio_ring.h
+#include linux/virtio_balloon.h
+
+#include 

[PATCH 8/9] kvm tools: Add 'kvm balloon' command

2011-06-29 Thread Sasha Levin
Add a command to allow easily inflate/deflate the balloon driver in running
instances.

Usage:
kvm balloon [command] [instance name] [size]

command is either inflate or deflate, and size is represented in MB.
Target instance must be named (started with '--name').

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/Makefile  |1 +
 tools/kvm/include/kvm/kvm-balloon.h |6 ++
 tools/kvm/kvm-balloon.c |   34 ++
 tools/kvm/kvm-cmd.c |   12 +++-
 tools/kvm/virtio/balloon.c  |8 
 5 files changed, 52 insertions(+), 9 deletions(-)
 create mode 100644 tools/kvm/include/kvm/kvm-balloon.h
 create mode 100644 tools/kvm/kvm-balloon.c

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index a1b2f4c..4823c77 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -50,6 +50,7 @@ OBJS  += kvm-cmd.o
 OBJS   += kvm-debug.o
 OBJS   += kvm-help.o
 OBJS+= kvm-pause.o
+OBJS+= kvm-balloon.o
 OBJS   += kvm-run.o
 OBJS   += mptable.o
 OBJS   += rbtree.o
diff --git a/tools/kvm/include/kvm/kvm-balloon.h 
b/tools/kvm/include/kvm/kvm-balloon.h
new file mode 100644
index 000..f5f92b9
--- /dev/null
+++ b/tools/kvm/include/kvm/kvm-balloon.h
@@ -0,0 +1,6 @@
+#ifndef KVM__BALLOON_H
+#define KVM__BALLOON_H
+
+int kvm_cmd_balloon(int argc, const char **argv, const char *prefix);
+
+#endif
diff --git a/tools/kvm/kvm-balloon.c b/tools/kvm/kvm-balloon.c
new file mode 100644
index 000..277cada
--- /dev/null
+++ b/tools/kvm/kvm-balloon.c
@@ -0,0 +1,34 @@
+#include stdio.h
+#include string.h
+#include signal.h
+
+#include kvm/util.h
+#include kvm/kvm-cmd.h
+#include kvm/kvm-balloon.h
+#include kvm/kvm.h
+
+int kvm_cmd_balloon(int argc, const char **argv, const char *prefix)
+{
+   int pid;
+   int amount, i;
+   int inflate = 0;
+
+   if (argc != 3)
+   die(Usage: kvm balloon [command] [instance name] [amount]\n);
+
+   pid = kvm__get_pid_by_instance(argv[1]);
+   if (pid  0)
+   die(Failed locating instance name);
+
+   if (strcmp(argv[0], inflate) == 0)
+   inflate = 1;
+   else if (strcmp(argv[0], deflate))
+   die(command can be either 'inflate' or 'deflate');
+
+   amount = atoi(argv[2]);
+
+   for (i = 0; i  amount; i++)
+   kill(pid, inflate ? SIGKVMADDMEM : SIGKVMDELMEM);
+
+   return 0;
+}
diff --git a/tools/kvm/kvm-cmd.c b/tools/kvm/kvm-cmd.c
index ffbc4ff..1598781 100644
--- a/tools/kvm/kvm-cmd.c
+++ b/tools/kvm/kvm-cmd.c
@@ -7,16 +7,18 @@
 /* user defined header files */
 #include kvm/kvm-debug.h
 #include kvm/kvm-pause.h
+#include kvm/kvm-balloon.h
 #include kvm/kvm-help.h
 #include kvm/kvm-cmd.h
 #include kvm/kvm-run.h
 
 struct cmd_struct kvm_commands[] = {
-   { pause, kvm_cmd_pause, NULL, 0 },
-   { debug, kvm_cmd_debug, NULL, 0 },
-   { help,  kvm_cmd_help,  NULL, 0 },
-   { run,   kvm_cmd_run,   kvm_run_help, 0 },
-   { NULL,NULL,  NULL, 0 },
+   { pause,  kvm_cmd_pause,  NULL, 0 },
+   { debug,  kvm_cmd_debug,  NULL, 0 },
+   { balloon,kvm_cmd_balloon,NULL, 0 },
+   { help,   kvm_cmd_help,   NULL, 0 },
+   { run,kvm_cmd_run,kvm_run_help, 0 },
+   { NULL, NULL,   NULL, 0 },
 };
 
 /*
diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c
index ab9ccb7..854d04b 100644
--- a/tools/kvm/virtio/balloon.c
+++ b/tools/kvm/virtio/balloon.c
@@ -39,7 +39,7 @@ struct bln_dev {
/* virtio queue */
u16 queue_selector;
struct virt_queue   vqs[NUM_VIRT_QUEUES];
-   void*jobs[NUM_VIRT_QUEUES];
+   struct thread_pool__job jobs[NUM_VIRT_QUEUES];
 
struct virtio_balloon_config config;
 };
@@ -174,13 +174,13 @@ static bool virtio_bln_pci_io_out(struct ioport *ioport, 
struct kvm *kvm, u16 po
 
vring_init(queue-vring, VIRTIO_BLN_QUEUE_SIZE, p, 
VIRTIO_PCI_VRING_ALIGN);
 
-   bdev.jobs[bdev.queue_selector] = thread_pool__add_job(kvm, 
virtio_bln_do_io, queue);
+   thread_pool__init_job(bdev.jobs[bdev.queue_selector], kvm, 
virtio_bln_do_io, queue);
 
ioevent = (struct ioevent) {
.io_addr= bdev.base_addr + 
VIRTIO_PCI_QUEUE_NOTIFY,
.io_len = sizeof(u16),
.fn = ioevent_callback,
-   .fn_ptr = 
bdev.jobs[bdev.queue_selector],
+   .fn_ptr = 
bdev.jobs[bdev.queue_selector],
.datamatch  = bdev.queue_selector,
.fn_kvm = kvm,
.fd = 

[PATCH 9/9] kvm tools: Stop VCPUs before freeing struct kvm

2011-06-29 Thread Sasha Levin
Not stopping VCPUs before leads to seg faults and other errors due to
synchronization between threads.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/term.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index a0cb03f..2a3e1f0 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -10,6 +10,7 @@
 #include kvm/term.h
 #include kvm/util.h
 #include kvm/kvm.h
+#include kvm/kvm-cpu.h
 
 extern struct kvm *kvm;
 static struct termios  orig_term;
@@ -34,6 +35,7 @@ int term_getc(int who)
if (term_got_escape) {
term_got_escape = false;
if (c == 'x') {
+   kvm_cpu__reboot();
kvm__delete(kvm);
printf(\n  # KVM session terminated.\n);
exit(1);
-- 
1.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm monitor socket - connection refused

2011-06-29 Thread Iordan Iordanov

Hi Michael,

On 06/29/11 11:52, Michael Tokarev wrote:

The only other explanation I can think of is that you tried
to run two instances of kvm, and when second instance initialized
it re-created the monitor socket but failed later (eg, when
initin network or something else) and exited, but left the
stray socket (JFYI, you can remove a unix-domain socket
where some process is listening, and create another - that
one will really be different socket, even if named the same
way, -- just like you can re-create a plain file the same
way).


This may have been what happened. I'll try to reproduce this scenario.

Is there no way to prevent the accidental overwriting of a monitor 
socket that is still being used? I.e. is there no way for kvm to realize 
that the socket is in use and complain?




In any way, there hasn't been any problems/bugs in that area
for ages.


This is what I was hoping to hear! :)

Thanks!
Iordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any problem if I use ionice on KVM?

2011-06-29 Thread Emmanuel Noobadmin
I keep running into a situation where a KVM guest will lock up on some
kind of disk process it seems. System load goes way up but cpu % is
relatively low based on a crond script collecting data before
everything goes south. As a result, the host becoming unresponsive as
well. Initially it appeared to be due to a routine maintenance script
which I resolved with a combination of noatime and ionice on the
script.

However, now it appears that some other event/process is also cause a
lock up at random points in time. It's practically impossible (or I'm
too noob) to troubleshoot and figure out what exactly is causing this.

So I'm wondering if it's safe to run ionice on the KVM process so that
a runaway guest will not pull down the host with it. Which would
perhaps in some ways allow me to try to figure out what is going on.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 This patch is simple, put in a different commit so it can be more easily
 shared between guest and hypervisor. It just defines a named constant
 to indicate the enable bit for KVM-specific MSRs.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

My mail provider seems to have dropped patch 1 of the series so I can't reply
directly to it, please add my Tested-by there as well.

Tested-by: Eric B Munson emun...@mgebm.net


signature.asc
Description: Digital signature


Re: [PATCH v3 3/9] KVM-HDR: KVM Steal time implementation

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 To implement steal time, we need the hypervisor to pass the guest information
 about how much time was spent running other processes outside the VM.
 This is per-vcpu, and using the kvmclock structure for that is an abuse
 we decided not to make.
 
 In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
 holds the memory area address containing information about steal time
 
 This patch contains the headers for it. I am keeping it separate to facilitate
 backports to people who wants to backport the kernel part but not the
 hypervisor, or the other way around.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 4/9] KVM-HV: KVM Steal time implementation

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 To implement steal time, we need the hypervisor to pass the guest information
 about how much time was spent running other processes outside the VM.
 This is per-vcpu, and using the kvmclock structure for that is an abuse
 we decided not to make.
 
 In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
 holds the memory area address containing information about steal time
 
 This patch contains the hypervisor part for it. I am keeping it separate from
 the headers to facilitate backports to people who wants to backport the kernel
 part but not the hypervisor, or the other way around.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 5/9] KVM-HV: use schedstats to calculate steal time

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 SCHEDSTATS provide a precise source of information about time tasks
 spent on a runqueue, but not running (among other things). It is
 specially useful for the steal time implementation, because it doesn't
 record halt time at all.
 
 To avoid a hard dependency on schedstats, since it is possible one won't
 want to record statistics about all processes running, the previous method
 of time measurement on put/load vcpu is kept  for !SCHEDSTATS.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net
 CC: Marcelo Tosatti mtosa...@redhat.com

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 This patch adds a function pointer in one of the many paravirt_ops
 structs, to allow guests to register a steal time function.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 This patch accounts steal time time in kernel/sched.
 I kept it from last proposal, because I still see advantages
 in it: Doing it here will give us easier access from scheduler
 variables such as the cpu rq. The next patch shows an example of
 usage for it.
 
 Since functions like account_idle_time() can be called from
 multiple places, not only account_process_tick(), steal time
 grabbing is repeated in each account function separatedely.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 8/9] KVM-GST: adjust scheduler cpu power

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 This is a first proposal for using steal time information
 to influence the scheduler. There are a lot of optimizations
 and fine grained adjustments to be done, but it is working reasonably
 so far for me (mostly)
 
 With this patch (and some host pinnings to demonstrate the situation),
 two vcpus with very different steal time (Say 80 % vs 1 %) will not get
 an even distribution of processes. This is a situation that can naturally
 arise, specially in overcommited scenarios. Previosly, the guest scheduler
 would wrongly think that all cpus have the same ability to run processes,
 lowering the overall throughput.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


Re: [PATCH v3 9/9] KVM-GST: KVM Steal time registration

2011-06-29 Thread Eric B Munson
On Wed, 29 Jun 2011, Glauber Costa wrote:

 Register steal time within KVM. Everytime we sample the steal time
 information, we update a local variable that tells what was the
 last time read. We then account the difference.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 CC: Anthony Liguori aligu...@us.ibm.com
 CC: Eric B Munson emun...@mgebm.net

Tested-by: Eric B Munson emun...@mgebm.net



signature.asc
Description: Digital signature


[PATCH] virt: Cleaning up debug messages

2011-06-29 Thread Lucas Meneghel Rodrigues
In order to make it easier for people to read KVM autotest logs,
went through the virt module and the kvm test, removing some not
overly useful debug messages and modified others. Some things that
were modified:

1) Removed MAC address management messages
2) Removed ellipses from most of the debug messages, as they're
unnecessary

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/kvm.py |2 -
 client/virt/kvm_vm.py   |   15 ---
 client/virt/virt_env_process.py |   50 ++
 client/virt/virt_test_setup.py  |   18 +++---
 client/virt/virt_test_utils.py  |   14 +-
 client/virt/virt_utils.py   |   16 
 client/virt/virt_vm.py  |   13 -
 7 files changed, 57 insertions(+), 71 deletions(-)

diff --git a/client/tests/kvm/kvm.py b/client/tests/kvm/kvm.py
index 84c361e..c69ad46 100644
--- a/client/tests/kvm/kvm.py
+++ b/client/tests/kvm/kvm.py
@@ -45,8 +45,6 @@ class kvm(test.test):
 virt_utils.set_log_file_dir(self.debugdir)
 
 # Open the environment file
-logging.info(Unpickling env. You may see some harmless error 
- messages.)
 env_filename = os.path.join(self.bindir, params.get(env, env))
 env = virt_utils.Env(env_filename, self.env_version)
 
diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index b7afeeb..a2f22b4 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -393,9 +393,6 @@ class VM(virt_vm.BaseVM):
 
 qemu_binary = virt_utils.get_path(root_dir, params.get(qemu_binary,
   qemu))
-# Get the output of 'qemu -help' (log a message in case this call never
-# returns or causes some other kind of trouble)
-logging.debug(Getting output of 'qemu -help')
 help = commands.getoutput(%s -help % qemu_binary)
 
 # Start constructing the qemu command
@@ -877,11 +874,11 @@ class VM(virt_vm.BaseVM):
 if self.is_dead():
 return
 
-logging.debug(Destroying VM with PID %s..., self.get_pid())
+logging.debug(Destroying VM with PID %s, self.get_pid())
 
 if gracefully and self.params.get(shutdown_command):
 # Try to destroy with shell command
-logging.debug(Trying to shutdown VM with shell command...)
+logging.debug(Trying to shutdown VM with shell command)
 try:
 session = self.login()
 except (virt_utils.LoginError, virt_vm.VMError), e:
@@ -891,7 +888,7 @@ class VM(virt_vm.BaseVM):
 # Send the shutdown command
 session.sendline(self.params.get(shutdown_command))
 logging.debug(Shutdown command sent; waiting for VM 
-  to go down...)
+  to go down)
 if virt_utils.wait_for(self.is_dead, 60, 1, 1):
 logging.debug(VM is down)
 return
@@ -900,7 +897,7 @@ class VM(virt_vm.BaseVM):
 
 if self.monitor:
 # Try to destroy with a monitor command
-logging.debug(Trying to kill VM with monitor command...)
+logging.debug(Trying to kill VM with monitor command)
 try:
 self.monitor.quit()
 except kvm_monitor.MonitorError, e:
@@ -912,8 +909,8 @@ class VM(virt_vm.BaseVM):
 return
 
 # If the VM isn't dead yet...
-logging.debug(Cannot quit normally; sending a kill to close the 
-  deal...)
+logging.debug(Cannot quit normally, sending a kill to close the 
+  deal)
 virt_utils.kill_process_tree(self.process.get_pid(), 9)
 # Wait for the VM to be really dead
 if virt_utils.wait_for(self.is_dead, 5, 0.5, 0.5):
diff --git a/client/virt/virt_env_process.py b/client/virt/virt_env_process.py
index b237ed2..b47a9a5 100644
--- a/client/virt/virt_env_process.py
+++ b/client/virt/virt_env_process.py
@@ -29,11 +29,10 @@ def preprocess_image(test, params):
 create_image = False
 
 if params.get(force_create_image) == yes:
-logging.debug('force_create_image' specified; creating image...)
+logging.debug(Param 'force_create_image' specified, creating image)
 create_image = True
 elif (params.get(create_image) == yes and not
   os.path.exists(image_filename)):
-logging.debug(Creating image...)
 create_image = True
 
 if create_image and not virt_vm.create_image(params, test.bindir):
@@ -50,10 +49,10 @@ def preprocess_vm(test, params, env, name):
 @param env: The environment (a dict-like object).
 @param name: The name of the VM object.
   

Re: [PATCH v2 00/11] KVM in-guest performance monitoring

2011-06-29 Thread Avi Kivity

On 06/13/2011 04:34 PM, Avi Kivity wrote:

This patchset exposes an emulated version 1 architectural performance
monitoring unit to KVM guests.  The PMU is emulated using perf_events,
so the host kernel can multiplex host-wide, host-user, and the
guest on available resources.

Caveats:
- counters that have PMI (interrupt) enabled stop counting after the
   interrupt is signalled.  This is because we need one-shot samples
   that keep counting, which perf doesn't support yet
- some combinations of INV and CMASK are not supported
- counters keep on counting in the host as well as the guest

perf maintainers: please consider the first three patches for merging (the
first two make sense even without the rest).  If you're familiar with the Intel
PMU, please review patch 5 as well - it effectively undoes all your work
of abstracting the PMU into perf_events by unabstracting perf_events into what
is hoped is a very similar PMU.

v2:
  -  don't pass perf_event handler context to the callback; extract it via the
 'event' parameter instead
  -  RDPMC emulation and interception
  -  CR4.PCE emulation


Peter, can you look at 1-3 please?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for June 28

2011-06-29 Thread Kevin Wolf
Am 28.06.2011 21:41, schrieb Marcelo Tosatti:
 On Tue, Jun 28, 2011 at 02:38:15PM +0100, Stefan Hajnoczi wrote:
 On Mon, Jun 27, 2011 at 3:32 PM, Juan Quintela quint...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

 Live block copy and image streaming:
  * The differences between Marcelo and Kevin's approaches
  * Which approach to choose and who can help implement it
 
 After more thinking, i dislike the image metadata approach. Management
 must carry the information anyway, so its pointless to duplicate it
 inside an image format.
 
 After the discussion today, i think the internal mechanism and interface
 should be different for copy and stream:
 
 block copy
 --
 
 With backing files:
 
 1) base - sn1 - sn2
 2) base - copy
 
 Without:
 
 1) source
 2) destination
 
 Copy is only valid after switch has been performed. Same interface and
 crash recovery characteristics for all image formats.
 
 If management wants to support continuation, it must specify
 blkcopy:sn2:copy on startup.

We can use almost the same interface and still have an image that is
always valid (assuming that you provide the right format on the command
line, which is already a requirement today).

base - sn1 - sn2 - copy.raw

You just add the file name for an external COW file, like
blkcopy:sn2:copy.raw:copy.cow (we can even have a default filename for
HMP instead of requiring to specify it, like $IMAGE.cow) and if the
destination doesn't support backing files by itself, blkcopy creates the
COW overlay BlockDriverState that uses this file.

No difference for management at all, except that it needs to allow
access to another file.

 stream
 --
 
 1) base - remote
 2) base - remote - local
 3) base - local
 
 local image is always valid. Requires backing file support.

With the above, this restriction wouldn't apply any more.

Also I don't think we should mix approaches. Either both block copy and
image streaming use backing files, or none of them do. Mixing means
duplicating more code, and even worse, that you can't stop a block copy
in the middle and continue with streaming (which I believe is a really
valuable feature to have).

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:29 PM, Xiao Guangrong wrote:

If the range spans a boundary, the mmio access can be broke, fix it as
write emulation.

And we already get the guest physical address, so use it to read guest data
directly to avoid walking guest page table again

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/x86.c |   41 -
  1 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b803f0..eb27be4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,14 +3944,13 @@ out:
  }
  EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);

-static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
- unsigned long addr,
- void *val,
- unsigned int bytes,
- struct x86_exception *exception)
+static int emulator_read_emulated_onepage(unsigned long addr,
+ void *val,
+ unsigned int bytes,
+ struct x86_exception *exception,
+ struct kvm_vcpu *vcpu)
  {
-   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-   gpa_t gpa;
+   gpa_t gpa;
int handled;

if (vcpu-mmio_read_completed) {
@@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt 
*ctxt,
if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
goto mmio;

-   if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
-   == X86EMUL_CONTINUE)
+   if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
return X86EMUL_CONTINUE;


This doesn't perform the cpl check.

I suggest dropping this part for now and doing it later.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Paolo Bonzini

On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote:


  If a device uses more than one queue it is the responsibility of the
  device to ensure strict request ordering.

Maybe I misunderstand - how can this be the responsibility of
the device if the device does not get the information about
the original ordering of the requests?

For example, if the driver is crazy enough to put
all write requests on one queue and all barriers
on another one, how is the device supposed to ensure
ordering?


I agree here, in fact I misread Hannes's comment as if a driver uses 
more than one queue it is responsibility of the driver to ensure strict 
request ordering.  If you send requests to different queues, you know 
that those requests are independent.  I don't think anything else is 
feasible in the virtio framework.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:29 PM, Xiao Guangrong wrote:

Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it
to cleanup the code between read emulation and write emulation

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/x86.c |   38 +-
  1 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb27be4..c29ef96 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,6 +3944,27 @@ out:
  }
  EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);

+static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
+  gpa_t *gpa, struct x86_exception *exception,
+  bool write)
+{
+   u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+
+   if (write)
+   access |= PFERR_WRITE_MASK;


Needs fetch as well so NX/SMEP can work.


+
+   *gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, gva, access, exception);


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Paolo Bonzini

On 06/14/2011 10:39 AM, Hannes Reinecke wrote:

If, however, we decide to expose some details about the backend, we
could be using the values from the backend directly.
EG we could be forwarding the SCSI target port identifier here
(if backed by real hardware) or creating our own SAS-type
identifier when backed by qemu block. Then we could just query
the backend via a new command on the controlq
(eg 'list target ports') and wouldn't have to worry about any protocol
specific details here.


Besides the controlq command, which I can certainly add, this is 
actually quite similar to what I had in mind (though my plan likely 
would not have worked because I was expecting hierarchical LUNs used 
uniformly).  So, list target ports would return a set of LUN values to 
which you can send REPORT LUNS, or something like that?  I suppose that 
if you're using real hardware as the backing storage the in-kernel 
target can provide that.


For the QEMU backend I'd keep hierarchical LUNs, though of course one 
could add a FC or SAS bus to QEMU, each implementing its own identifier 
scheme.


If I understand it correctly, it should remain possible to use a single 
host for both pass-through and emulated targets.


Would you draft the command structure, so I can incorporate it into the 
spec?



Of course, when doing so we would be lose the ability to freely remap
LUNs. But then remapping LUNs doesn't gain you much imho.
Plus you could always use qemu block backend here if you want
to hide the details.


And you could always use the QEMU block backend with scsi-generic if you 
want to remap LUNs, instead of true passthrough via the kernel target.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:30 PM, Xiao Guangrong wrote:

The operations of read emulation and write emulation are very similar, so we
can abstract the operation of them, in larter patch, it is used to cleanup the
same code

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/x86.c |   72 
  1 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c29ef96..887714f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
return 1;
  }

+struct read_write_emulator_ops {
+   int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
+ int bytes);
+   int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
+ void *val, int bytes);
+   int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+  int bytes, void *val);
+   int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+   void *val, int bytes);
+   bool write;
+};



Interesting!

This structure combines two unrelated operations, though.  One is the 
internals of the iteration on a virtual address that is split to various 
physical addresses.  The other is the interaction with userspace on mmio 
exits.  They should be split, but I think it's fine to do it in a later 
patch.  This series is long enough already.


I was also annoyed by the duplication.  They way I thought of fixing it 
is having gva_to_gpa() return two gpas, and having the access function 
accept gpa vectors.  The reason was so that we can implemented locked 
cross-page operations (which we now emulate as unlocked writes).


But I think we can do without it, and instead emulated locked cross-page 
ops by stalling all other vcpus while we write, or by unmapping the 
pages involved.  It isn't pretty but it doesn't need to be fast since 
it's a very rare operation.  So I think we can go with your approach.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] KVM in-guest performance monitoring

2011-06-29 Thread Peter Zijlstra
On Wed, 2011-06-29 at 10:52 +0300, Avi Kivity wrote:
 On 06/13/2011 04:34 PM, Avi Kivity wrote:
  This patchset exposes an emulated version 1 architectural performance
  monitoring unit to KVM guests.  The PMU is emulated using perf_events,
  so the host kernel can multiplex host-wide, host-user, and the
  guest on available resources.
 
  Caveats:
  - counters that have PMI (interrupt) enabled stop counting after the
 interrupt is signalled.  This is because we need one-shot samples
 that keep counting, which perf doesn't support yet
  - some combinations of INV and CMASK are not supported
  - counters keep on counting in the host as well as the guest
 
  perf maintainers: please consider the first three patches for merging (the
  first two make sense even without the rest).  If you're familiar with the 
  Intel
  PMU, please review patch 5 as well - it effectively undoes all your work
  of abstracting the PMU into perf_events by unabstracting perf_events into 
  what
  is hoped is a very similar PMU.
 
  v2:
-  don't pass perf_event handler context to the callback; extract it via 
  the
   'event' parameter instead
-  RDPMC emulation and interception
-  CR4.PCE emulation
 
 Peter, can you look at 1-3 please?

Queued them, thanks!

I was more or less waiting for a next iteration of the series because of
those problems reported, but those three stand well on their own.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFT: virtio_net: limit xmit polling

2011-06-29 Thread Michael S. Tsirkin
On Tue, Jun 28, 2011 at 11:08:07AM -0500, Tom Lendacky wrote:
 On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote:
  OK, different people seem to test different trees.  In the hope to get
  everyone on the same page, I created several variants of this patch so
  they can be compared. Whoever's interested, please check out the
  following, and tell me how these compare:
  
  kernel:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
  
  virtio-net-limit-xmit-polling/base - this is net-next baseline to test
  against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity
  virtio-net-limit-xmit-polling/v1 - previous revision of the patch
  this does xmit,free,xmit,2*free,free
  virtio-net-limit-xmit-polling/v2 - new revision of the patch
  this does free,xmit,2*free,free
  
 
 Here's a summary of the results.  I've also attached an ODS format spreadsheet
 (30 KB in size) that might be easier to analyze and also has some pinned VM
 results data.  I broke the tests down into a local guest-to-guest scenario
 and a remote host-to-guest scenario.
 
 Within the local guest-to-guest scenario I ran:
   - TCP_RR tests using two different messsage sizes and four different
 instance counts among 1 pair of VMs and 2 pairs of VMs.
   - TCP_STREAM tests using four different message sizes and two different
 instance counts among 1 pair of VMs and 2 pairs of VMs.
 
 Within the remote host-to-guest scenario I ran:
   - TCP_RR tests using two different messsage sizes and four different
 instance counts to 1 VM and 4 VMs.
   - TCP_STREAM and TCP_MAERTS tests using four different message sizes and
 two different instance counts to 1 VM and 4 VMs.
 over a 10GbE link.

roprabhu, Tom,

Thanks very much for the testing. So on the first glance
one seems to see a significant performance gain in V0 here,
and a slightly less significant in V2, with V1
being worse than base. But I'm afraid that's not the
whole story, and we'll need to work some more to
know what really goes on, please see below.


Some comments on the results: I found out that V0 because of mistake
on my part was actually almost identical to base.
I pushed out virtio-net-limit-xmit-polling/v1a instead that
actually does what I intended to check. However,
the fact we get such a huge distribution in the results by Tom
most likely means that the noise factor is very large.


From my experience one way to get stable results is to
divide the throughput by the host CPU utilization
(measured by something like mpstat).
Sometimes throughput doesn't increase (e.g. guest-host)
by CPU utilization does decrease. So it's interesting.


Another issue is that we are trying to improve the latency
of a busy queue here. However STREAM/MAERTS tests ignore the latency
(more or less) while TCP_RR by default runs a single packet per queue.
Without arguing about whether these are practically interesting
workloads, these results are thus unlikely to be significantly affected
by the optimization in question.

What we are interested in, thus, is either TCP_RR with a -b flag
(configure with  --enable-burst) or multiple concurrent
TCP_RRs.



 *** Local Guest-to-Guest ***
 
 Here's the local guest-to-guest summary for 1 VM pair doing TCP_RR with
 256/256 request/response message size in transactions per second:
 
 Instances BaseV0  V1  V2
 1  8,151.568,460.728,439.169,990.37
 2548,761.74   51,032.62   51,103.25   49,533.52
 5055,687.38   55,974.18   56,854.10   54,888.65
 100   58,255.06   58,255.86   60,380.90   59,308.36
 
 Here's the local guest-to-guest summary for 2 VM pairs doing TCP_RR with
 256/256 request/response message size in transactions per second:
 
 Instances BaseV0  V1  V2
 1 18,758.48   19,112.50   18,597.07   19,252.04
 2580,500.50   78,801.78   80,590.68   78,782.07
 5080,594.20   77,985.44   80,431.72   77,246.90
 100   82,023.23   81,325.96   81,303.32   81,727.54
 
 Here's the local guest-to-guest summary for 1 VM pair doing TCP_STREAM with
 256, 1K, 4K and 16K message size in Mbps:
 
 256:
 Instances BaseV0  V1  V2
 1961.781,115.92  794.02  740.37
 4  2,498.332,541.822,441.602,308.26
 
 1K:   
 1  3,476.613,522.022,170.861,395.57
 4  6,344.307,056.577,275.167,174.09
 
 4K:   
 1  9,213.57   10,647.449,883.429,007.29
 4 11,070.66   11,300.37   11,001.02   12,103.72
 
 16K:
 1 12,065.949,437.78   

Re: virtio scsi host draft specification, v3

2011-06-29 Thread Michael S. Tsirkin
On Wed, Jun 29, 2011 at 10:23:26AM +0200, Paolo Bonzini wrote:
 On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote:
 
   If a device uses more than one queue it is the responsibility of the
   device to ensure strict request ordering.
 Maybe I misunderstand - how can this be the responsibility of
 the device if the device does not get the information about
 the original ordering of the requests?
 
 For example, if the driver is crazy enough to put
 all write requests on one queue and all barriers
 on another one, how is the device supposed to ensure
 ordering?
 
 I agree here, in fact I misread Hannes's comment as if a driver
 uses more than one queue it is responsibility of the driver to
 ensure strict request ordering.  If you send requests to different
 queues, you know that those requests are independent.  I don't think
 anything else is feasible in the virtio framework.
 
 Paolo

Like this then?

  If a driver uses more than one queue it is the responsibility of the
  driver to ensure strict request ordering: the device does not
  supply any guarantees about the ordering of requests between different
  virtqueues.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:31 PM, Xiao Guangrong wrote:

If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction


Does this work if the mmio spans two pages?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] perf support for amd guest/host-only bits v2

2011-06-29 Thread Peter Zijlstra
On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote:
 On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote:
  this is the second version of the patch-set to support the AMD
  guest-/host only bits in the performance counter MSRs. Due to lack of
  time I havn't looked into emulating support for this feature on Intel or
  other architectures, but the other comments should be worked in. The
  changes to v1 include:
  
  * Rebased patches to v3.0-rc3
  * Allow exclude_guest and exclude_host set at the same time
  * Reworked event-parse logic for the new exclude-bits
  * Only count guest-events per default from perf-kvm
 
 Hi Peter, Ingo,
 
 have you had a chance to look at this patch-set? Are any changes
 required?

I would feel a lot more comfortable by having it implemented on all of
x86 as well as at least one !x86 platform. Avi graciously volunteered
for the Intel bits.  

Paulus, I hear from benh that you're also responsible for the ppc-kvm
bits, could you possibly find some time to implement this feature for
ppc?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V7 4/4 net-next] vhost: vhost TX zero-copy support

2011-06-29 Thread Michael S. Tsirkin
On Sat, May 28, 2011 at 12:34:27PM -0700, Shirley Ma wrote:
 Hello Michael,
 
 In order to use wait for completion in shutting down, seems to me
 another work thread is needed to call vhost_zerocopy_add_used,

Hmm I don't see vhost_zerocopy_add_used here.

 it seems
 too much work to address a minor issue here. Do we really need it?

Assuming you mean vhost_zerocopy_signal_used, here's how I would do it:
add a kref and a completion, signal completion in kref_put
callback, when backend is set - kref_get, on cleanup,
kref_put and then wait_for_completion_interruptible.
Where's the need for another thread coming from?

If you like, post a patch with busywait + a FIXME comment,
and I can write up a patch on top.

(BTW, ideally the function that does the signalling should be
in core networking bits so that it's still around
even if the vhost module gets removed).

 Right now, the approach I am using is to ignore outstanding userspace
 buffers during shutting down if any, the device might DMAed some wrong
 data to the wire, do we really care?
 
 Thanks
 Shirley

I think so, yes, guest is told that memory can be reused so
it might put the credit card number or whatever there :)

 
 
 This patch maintains the outstanding userspace buffers in the 
 sequence it is delivered to vhost. The outstanding userspace buffers 
 will be marked as done once the lower device buffers DMA has finished. 
 This is monitored through last reference of kfree_skb callback. Two
 buffer index are used for this purpose.
 
 The vhost passes the userspace buffers info to lower device skb 
 through message control. Since there will be some done DMAs when
 entering vhost handle_tx. The worse case is all buffers in the vq are
 in pending/done status, so we need to notify guest to release DMA done 
 buffers first before get any new buffers from the vq.
 
 Signed-off-by: Shirley x...@us.ibm.com
 ---
 
  drivers/vhost/net.c   |   46 +-
  drivers/vhost/vhost.c |   47 +++
  drivers/vhost/vhost.h |   15 +++
  3 files changed, 107 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 2f7c76a..e2eaba6 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -32,6 +32,11 @@
   * Using this limit prevents one virtqueue from starving others. */
  #define VHOST_NET_WEIGHT 0x8
  
 +/* MAX number of TX used buffers for outstanding zerocopy */
 +#define VHOST_MAX_PEND 128
 +/* change it to 256 when small message size performance issue is addressed */
 +#define VHOST_GOODCOPY_LEN 2048
 +
  enum {
   VHOST_NET_VQ_RX = 0,
   VHOST_NET_VQ_TX = 1,
 @@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
   hdr_size = vq-vhost_hlen;
  
   for (;;) {
 + /* Release DMAs done buffers first */
 + if (atomic_read(vq-refcnt)  VHOST_MAX_PEND)
 + vhost_zerocopy_signal_used(vq, false);
 +
   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
ARRAY_SIZE(vq-iov),
out, in,
 @@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
   set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
   break;
   }
 + /* If more outstanding DMAs, queue the work */
 + if (atomic_read(vq-refcnt)  VHOST_MAX_PEND) {
 + tx_poll_start(net, sock);
 + set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
 + break;
 + }
   if (unlikely(vhost_enable_notify(vq))) {
   vhost_disable_notify(vq);
   continue;
 @@ -188,6 +203,26 @@ static void handle_tx(struct vhost_net *net)
  iov_length(vq-hdr, s), hdr_size);
   break;
   }
 + /* use msg_control to pass vhost zerocopy ubuf info to skb */
 + if (sock_flag(sock-sk, SOCK_ZEROCOPY)) {
 + vq-heads[vq-upend_idx].id = head;
 + if (len  VHOST_GOODCOPY_LEN)
 + /* copy don't need to wait for DMA done */
 + vq-heads[vq-upend_idx].len =
 + VHOST_DMA_DONE_LEN;
 + else {
 + struct ubuf_info *ubuf = vq-ubuf_info[head];
 +
 + vq-heads[vq-upend_idx].len = len;
 + ubuf-callback = vhost_zerocopy_callback;
 + ubuf-arg = vq;
 + ubuf-desc = vq-upend_idx;
 + msg.msg_control = ubuf;
 + msg.msg_controllen = sizeof(ubuf);
 + }
 + 

Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:35 PM, Xiao Guangrong wrote:

Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault




  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
  {
@@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

kvm_flush_remote_tlbs(kvm);

+   if (atomic_read(kvm-arch.reader_counter)) {
+   kvm_mmu_isolate_pages(invalid_list);
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   list_del_init(invalid_list);
+   call_rcu(sp-rcu, free_pages_rcu);
+   return;
+   }
+


I think we should do this unconditionally.  The cost of ping-ponging the 
shared cache line containing reader_counter will increase with large smp 
counts.  On the other hand, zap_page is very rare, so it can be a little 
slower.  Also, less code paths = easier to understand.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:36 PM, Xiao Guangrong wrote:

The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1319050..e69a47a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;/* mutual 
exclusive with nx_mask */
  static u64 __read_mostly shadow_user_mask;
  static u64 __read_mostly shadow_accessed_mask;
  static u64 __read_mostly shadow_dirty_mask;
+static u64 __read_mostly shadow_mmio_mask = (0xffull  49 | 1ULL);


One bit is shifted out.   And it will fail with 52-bit MAXPHYADDR.

Please in addition, set the xwr bits to an invalid pattern on EPT (there 
is an MSR which specifies which patterns are valid; for example 
execute-only or write-only are invalid).  If all patterns are valid AND 
MAXPHYADDR == 52, then just set the mask to 0 and it the optimization 
will be disabled.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/22] KVM: optimize for MMIO handled

2011-06-29 Thread Avi Kivity

On 06/22/2011 05:27 PM, Xiao Guangrong wrote:

In this version, we fix the bugs in the v1:
- fix broken read emulation spans a page boundary
- fix invalid spte point is got if we walk shadow page table
   out of the mmu lock

And, we also introduce some rules to modify spte in this version,
then it does not need to atomically clear/set spte on x86_32 host
anymore, the performance report of x86_32 host is in the later section

Avi,

I have sampled the operation of lockless shadow page walking as below steps:
- mark walk_shadow_page_get_mmio_spte as 'noinline'
- do the netperf test, the client is on the guest(NIC is e1000) and the server
   is on the host, it can generate large press of mmio access
- using perf to sample it, and the result of 'perf report' is attached

The ratio of walk_shadow_page_get_mmio_spte is 0.09%, the ratio of 
handle_ept_misconfig
is 0.11%, the ratio of handle_mmio_page_fault_common is 0.07%

I think it is acceptable, your opinion?



Yes.

The patchset scares me, but it is nice work!  Good optimization and good 
clean up.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] KVM in-guest performance monitoring

2011-06-29 Thread Avi Kivity

On 06/29/2011 11:38 AM, Peter Zijlstra wrote:


  Peter, can you look at 1-3 please?

Queued them, thanks!

I was more or less waiting for a next iteration of the series because of
those problems reported, but those three stand well on their own.


Thanks.  I'm mired in other work but will return to investigate  fix 
those issues.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] perf support for amd guest/host-only bits v2

2011-06-29 Thread Avi Kivity

On 06/29/2011 12:02 PM, Peter Zijlstra wrote:


  have you had a chance to look at this patch-set? Are any changes
  required?

I would feel a lot more comfortable by having it implemented on all of
x86 as well as at least one !x86 platform. Avi graciously volunteered
for the Intel bits.


Silly me.  Joerg, can you post the git tree publicly please?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Stefan Hajnoczi
On Wed, Jun 29, 2011 at 9:33 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 On 06/14/2011 10:39 AM, Hannes Reinecke wrote:
 If, however, we decide to expose some details about the backend, we
 could be using the values from the backend directly.
 EG we could be forwarding the SCSI target port identifier here
 (if backed by real hardware) or creating our own SAS-type
 identifier when backed by qemu block. Then we could just query
 the backend via a new command on the controlq
 (eg 'list target ports') and wouldn't have to worry about any protocol
 specific details here.

 Besides the controlq command, which I can certainly add, this is
 actually quite similar to what I had in mind (though my plan likely
 would not have worked because I was expecting hierarchical LUNs used
 uniformly).  So, list target ports would return a set of LUN values to
 which you can send REPORT LUNS, or something like that?

I think we're missing a level of addressing.  We need the ability to
talk to multiple target ports in order for list target ports to make
sense.  Right now there is one implicit target that handles all
commands.  That means there is one fixed I_T Nexus.

If we introduce list target ports we also need a way to say This
CDB is destined for target port #0.  Then it is possible to enumerate
target ports and address targets independently of the LUN field in the
CDB.

I'm pretty sure this is also how SAS and other transports work.  In
their framing they include the target port.

The question is whether we really need to support multiple targets on
a virtio-scsi adapter or not.  If you are selectively mapping LUNs
that the guest may access, then multiple targets are not necessary.
If we want to do pass-through of the entire SCSI bus then we need
multiple targets but I'm not sure if there are other challenges like
dependencies on the transport (Fibre Channel, SAS, etc) which make it
impossible to pass through bus-level access?

 If I understand it correctly, it should remain possible to use a single
 host for both pass-through and emulated targets.

Yes.

 Of course, when doing so we would be lose the ability to freely remap
 LUNs. But then remapping LUNs doesn't gain you much imho.
 Plus you could always use qemu block backend here if you want
 to hide the details.

 And you could always use the QEMU block backend with scsi-generic if you
 want to remap LUNs, instead of true passthrough via the kernel target.

IIUC the in-kernel target always does remapping.  It passes through
individual LUNs rather than entire targets and you pick LU Numbers to
map to the backing storage (which may or may not be a SCSI
pass-through device).  Nicholas Bellinger can confirm whether this is
correct.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4] qemu-img: Add cache command line option

2011-06-29 Thread Kevin Wolf
Am 20.06.2011 18:48, schrieb Federico Simoncelli:
 qemu-img currently writes disk images using writeback and filling
 up the cache buffers which are then flushed by the kernel preventing
 other processes from accessing the storage.
 This is particularly bad in cluster environments where time-based
 algorithms might be in place and accessing the storage within
 certain timeouts is critical.
 This patch adds the option to choose a cache method when writing
 disk images.
 
 Signed-off-by: Federico Simoncelli fsimo...@redhat.com

Thanks, applied to the block branch.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] perf support for amd guest/host-only bits v2

2011-06-29 Thread Joerg Roedel
On Wed, Jun 29, 2011 at 12:27:48PM +0300, Avi Kivity wrote:
 On 06/29/2011 12:02 PM, Peter Zijlstra wrote:
 
   have you had a chance to look at this patch-set? Are any changes
   required?

 I would feel a lot more comfortable by having it implemented on all of
 x86 as well as at least one !x86 platform. Avi graciously volunteered
 for the Intel bits.

 Silly me.  Joerg, can you post the git tree publicly please?

Okay, I pushed it to

git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-kvm.git 
perf-guest-counting

It probably takes some time until it appears on the mirrors.

Thanks,

Joerg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] perf support for amd guest/host-only bits v2

2011-06-29 Thread Roedel, Joerg
On Wed, Jun 29, 2011 at 05:02:54AM -0400, Peter Zijlstra wrote:
 On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote:
  On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote:
   this is the second version of the patch-set to support the AMD
   guest-/host only bits in the performance counter MSRs. Due to lack of
   time I havn't looked into emulating support for this feature on Intel or
   other architectures, but the other comments should be worked in. The
   changes to v1 include:
   
 * Rebased patches to v3.0-rc3
 * Allow exclude_guest and exclude_host set at the same time
 * Reworked event-parse logic for the new exclude-bits
 * Only count guest-events per default from perf-kvm
  
  Hi Peter, Ingo,
  
  have you had a chance to look at this patch-set? Are any changes
  required?
 
 I would feel a lot more comfortable by having it implemented on all of
 x86 as well as at least one !x86 platform. Avi graciously volunteered
 for the Intel bits.

Ok, since no changes are required from my side then, how about adding
support for more hardware successively like it was done for perf-kvm?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Tue, Jun 14, 2011 at 05:30:24PM +0200, Hannes Reinecke wrote:
 Which is exactly the problem I was referring to.
 When using more than one channel the request ordering
 _as seen by the initiator_ has to be preserved.
 
 This is quite hard to do from a device's perspective;
 it might be able to process the requests _in the order_ they've
 arrived, but it won't be able to figure out the latency of each
 request, ie how it'll take the request to be delivered to the
 initiator.
 
 What we need to do here is to ensure that virtio will deliver
 the requests in-order across all virtqueues. Not sure whether it
 does this already.

This only matters for ordered tags, or implicit or explicit HEAD OF
QUEUE tags.  For everything else there's no ordering requirement.
Given that ordered tags don't matter in practice and we don't have
to support them this just leaves HEAD OF QUEUE.  I suspect the
HEAD OF QUEUE semantics need to be implemented using underlying
draining of all queues, which should be okay given that it's
usually used in slow path commands.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Sun, Jun 12, 2011 at 10:51:41AM +0300, Michael S. Tsirkin wrote:
 For example, if the driver is crazy enough to put
 all write requests on one queue and all barriers
 on another one, how is the device supposed to ensure
 ordering?

There is no such things as barriers in SCSI.  The thing that comes
closest is ordered tags, which neither Linux nor any mainstream OS
uses, and which we don't have to (and generally don't want to)
implement.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 10:23:26AM +0200, Paolo Bonzini wrote:
 I agree here, in fact I misread Hannes's comment as if a driver
 uses more than one queue it is responsibility of the driver to
 ensure strict request ordering.  If you send requests to different
 queues, you know that those requests are independent.  I don't think
 anything else is feasible in the virtio framework.

That doesn't really fit very well with the SAM model.  If we want
to use multiple queues for a single LUN it has to be transparent to
the SCSI command stream.  Then again I don't quite see the use for
that anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Paolo Bonzini

On 06/29/2011 12:03 PM, Christoph Hellwig wrote:

  I agree here, in fact I misread Hannes's comment as if a driver
  uses more than one queue it is responsibility of the driver to
  ensure strict request ordering.  If you send requests to different
  queues, you know that those requests are independent.  I don't think
  anything else is feasible in the virtio framework.

That doesn't really fit very well with the SAM model.  If we want
to use multiple queues for a single LUN it has to be transparent to
the SCSI command stream.  Then again I don't quite see the use for
that anyway.


Agreed, I see a use for multiple queues (MSI-X), but not for multiple 
queues shared by a single LUN.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote:
 I think we're missing a level of addressing.  We need the ability to
 talk to multiple target ports in order for list target ports to make
 sense.  Right now there is one implicit target that handles all
 commands.  That means there is one fixed I_T Nexus.
 
 If we introduce list target ports we also need a way to say This
 CDB is destined for target port #0.  Then it is possible to enumerate
 target ports and address targets independently of the LUN field in the
 CDB.
 
 I'm pretty sure this is also how SAS and other transports work.  In
 their framing they include the target port.

Yes, exactly.  Hierachial LUNs are a nasty fringe feature that we should
avoid as much as possible, that is for everything but IBM vSCSI which is
braindead enough to force them.

 The question is whether we really need to support multiple targets on
 a virtio-scsi adapter or not.  If you are selectively mapping LUNs
 that the guest may access, then multiple targets are not necessary.
 If we want to do pass-through of the entire SCSI bus then we need
 multiple targets but I'm not sure if there are other challenges like
 dependencies on the transport (Fibre Channel, SAS, etc) which make it
 impossible to pass through bus-level access?

I don't think bus-level pass through is either easily possible nor
desirable.  What multiple targets are useful for is allowing more
virtual disks than we have virtual PCI slots.  We could do this by
supporting multiple LUNs, but given that many SCSI ressources are
target-based doing multiple targets most likely is the more scabale
and more logical variant.  E.g. we could much more easily have one
virtqueue per target than per LUN.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for June 28

2011-06-29 Thread Stefan Hajnoczi
On Wed, Jun 29, 2011 at 8:57 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 28.06.2011 21:41, schrieb Marcelo Tosatti:
 stream
 --

 1) base - remote
 2) base - remote - local
 3) base - local

 local image is always valid. Requires backing file support.

 With the above, this restriction wouldn't apply any more.

 Also I don't think we should mix approaches. Either both block copy and
 image streaming use backing files, or none of them do. Mixing means
 duplicating more code, and even worse, that you can't stop a block copy
 in the middle and continue with streaming (which I believe is a really
 valuable feature to have).

Here is how the image streaming feature is used from HMP/QMP:

The guest is running from an image file with a backing file.  The aim
is to pull the data from the backing file and populate the image file
so that the dependency on the backing file can be eliminated.

1. Start a background streaming operation:

(qemu) block_stream -a ide0-hd

2. Check the status of the operation:

(qemu) info block-stream
Streaming device ide0-hd: Completed 512 of 34359738368 bytes

3. The status changes when the operation completes:

(qemu) info block-stream
No active stream

On completion the image file no longer has a backing file dependency.
When streaming completes QEMU updates the image file metadata to
indicate that no backing file is used.

The QMP interface is similar but provides QMP events to signal
streaming completion and failure.  Polling to query the streaming
status is only used when the management application wishes to refresh
progress information.

If guest execution is interrupted by a power failure or QEMU crash,
then the image file is still valid but streaming may be incomplete.
When QEMU is launched again the block_stream command can be issued to
resume streaming.

In the future we could add a 'base' argument to block_stream.  If base
is specified then data contained in the base image will not be copied.
 This can be used to merge data from an intermediate image without
merging the base image.  When streaming completes the backing file
will be set to the base image.  The backing file relationship would
typically look like this:

1. Before block_stream -a -b base.img ide0-hd completion:

base.img - sn1 - ... - ide0-hd.qed

2. After streaming completes:

base.img - ide0-hd.qed

This describes the image streaming use cases that I, Adam, and Anthony
propose to support.  In the course of the discussion we've sometimes
been distracted with the internals of what a unified live block
copy/image streaming implementation should do.  I wanted to post this
summary of image streaming to refocus us on the use case and the APIs
that users will see.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Hannes Reinecke

On 06/29/2011 12:07 PM, Christoph Hellwig wrote:

On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote:

I think we're missing a level of addressing.  We need the ability to
talk to multiple target ports in order for list target ports to make
sense.  Right now there is one implicit target that handles all
commands.  That means there is one fixed I_T Nexus.

If we introduce list target ports we also need a way to say This
CDB is destined for target port #0.  Then it is possible to enumerate
target ports and address targets independently of the LUN field in the
CDB.

I'm pretty sure this is also how SAS and other transports work.  In
their framing they include the target port.


Yes, exactly.  Hierachial LUNs are a nasty fringe feature that we should
avoid as much as possible, that is for everything but IBM vSCSI which is
braindead enough to force them.


Yep.


The question is whether we really need to support multiple targets on
a virtio-scsi adapter or not.  If you are selectively mapping LUNs
that the guest may access, then multiple targets are not necessary.
If we want to do pass-through of the entire SCSI bus then we need
multiple targets but I'm not sure if there are other challenges like
dependencies on the transport (Fibre Channel, SAS, etc) which make it
impossible to pass through bus-level access?


I don't think bus-level pass through is either easily possible nor
desirable.  What multiple targets are useful for is allowing more
virtual disks than we have virtual PCI slots.  We could do this by
supporting multiple LUNs, but given that many SCSI ressources are
target-based doing multiple targets most likely is the more scabale
and more logical variant.  E.g. we could much more easily have one
virtqueue per target than per LUN.


The general idea here is that we can support NPIV.
With NPIV we'll have several scsi_hosts, each of which is assigned a 
different set of LUNs by the array.
With virtio we need to able to react on LUN remapping on the array 
side, ie we need to be able to issue a 'REPORT LUNS' command and 
add/remove LUNs on the fly. This means we have to expose the 
scsi_host in some way via virtio.


This is impossible with a one-to-one mapping between targets and 
LUNs. The actual bus-level pass-through will be just on the SCSI 
layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN 
remapping internally on the host is a totally different matter.
Same goes for the transport details; I doubt we will expose all the 
dingy details of the various transports, but rather restrict 
ourselves to an abstract transport.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2] access: check SMEP on prefetch pte path

2011-06-29 Thread Yang, Wei
This patch adds SMEP to all test cases and checks SMEP on prefetch
pte path when cr0.wp=0.

 changes since v1:
Add SMEP to all test cases and verify it before setting cr4

 Signed-off-by: Yang, Wei wei.y.y...@intel.com
 Signed-off-by: Shan, Haitao haitao.s...@intel.com
 Signed-off-by: Li, Xin xin...@intel.com

---
 x86/access.c   |  108 ++--
 x86/cstart64.S |1 +
 2 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/x86/access.c b/x86/access.c
index 7c8b9a5..22e5988 100644
--- a/x86/access.c
+++ b/x86/access.c
@@ -27,6 +27,7 @@ typedef unsigned long pt_element_t;
 #define PT_NX_MASK ((pt_element_t)1  63)
 
 #define CR0_WP_MASK (1UL  16)
+#define CR4_SMEP_MASK (1UL  20)
 
 #define PFERR_PRESENT_MASK (1U  0)
 #define PFERR_WRITE_MASK (1U  1)
@@ -70,6 +71,7 @@ enum {
 
 AC_CPU_EFER_NX,
 AC_CPU_CR0_WP,
+AC_CPU_CR4_SMEP,
 
 NR_AC_FLAGS
 };
@@ -96,6 +98,7 @@ const char *ac_names[] = {
 [AC_ACCESS_TWICE] = twice,
 [AC_CPU_EFER_NX] = efer.nx,
 [AC_CPU_CR0_WP] = cr0.wp,
+[AC_CPU_CR4_SMEP] = cr4.smep,
 };
 
 static inline void *va(pt_element_t phys)
@@ -130,6 +133,14 @@ typedef struct {
 
 static void ac_test_show(ac_test_t *at);
 
+int write_cr4_checking(unsigned long val)
+{
+asm volatile(ASM_TRY(1f)
+mov %0,%%cr4\n\t
+1:: : r (val));
+return exception_vector();
+}
+
 void set_cr0_wp(int wp)
 {
 unsigned long cr0 = read_cr0();
@@ -140,6 +151,16 @@ void set_cr0_wp(int wp)
 write_cr0(cr0);
 }
 
+void set_cr4_smep(int smep)
+{
+unsigned long cr4 = read_cr4();
+
+cr4 = ~CR4_SMEP_MASK;
+if (smep)
+   cr4 |= CR4_SMEP_MASK;
+write_cr4(cr4);
+}
+
 void set_efer_nx(int nx)
 {
 unsigned long long efer;
@@ -187,7 +208,12 @@ int ac_test_bump_one(ac_test_t *at)
 
 _Bool ac_test_legal(ac_test_t *at)
 {
-if (at-flags[AC_ACCESS_FETCH]  at-flags[AC_ACCESS_WRITE])
+/*
+ * Since we convert current page to kernel page when cr4.smep=1,
+ * we can't switch to user mode.
+ */
+if ((at-flags[AC_ACCESS_FETCH]  at-flags[AC_ACCESS_WRITE]) ||
+(at-flags[AC_ACCESS_USER]  at-flags[AC_CPU_CR4_SMEP]))
return false;
 return true;
 }
@@ -287,6 +313,9 @@ void ac_set_expected_status(ac_test_t *at)
 if (at-flags[AC_PDE_PSE]) {
if (at-flags[AC_ACCESS_WRITE]  !at-expected_fault)
at-expected_pde |= PT_DIRTY_MASK;
+   if (at-flags[AC_ACCESS_FETCH]  at-flags[AC_PDE_USER]
+at-flags[AC_CPU_CR4_SMEP])
+   at-expected_fault = 1;
goto no_pte;
 }
 
@@ -306,7 +335,11 @@ void ac_set_expected_status(ac_test_t *at)
 (at-flags[AC_CPU_CR0_WP] || at-flags[AC_ACCESS_USER]))
at-expected_fault = 1;
 
-if (at-flags[AC_ACCESS_FETCH]  at-flags[AC_PTE_NX])
+if (at-flags[AC_ACCESS_FETCH]
+(at-flags[AC_PTE_NX]
+   || (at-flags[AC_CPU_CR4_SMEP]
+at-flags[AC_PDE_USER]
+at-flags[AC_PTE_USER])))
at-expected_fault = 1;
 
 if (at-expected_fault)
@@ -320,7 +353,7 @@ no_pte:
 fault:
 if (!at-expected_fault)
 at-ignore_pde = 0;
-if (!at-flags[AC_CPU_EFER_NX])
+if (!at-flags[AC_CPU_EFER_NX]  !at-flags[AC_CPU_CR4_SMEP])
 at-expected_error = ~PFERR_FETCH_MASK;
 }
 
@@ -469,6 +502,14 @@ int ac_test_do_access(ac_test_t *at)
 unsigned r = unique;
 set_cr0_wp(at-flags[AC_CPU_CR0_WP]);
 set_efer_nx(at-flags[AC_CPU_EFER_NX]);
+if (at-flags[AC_CPU_CR4_SMEP]  !(cpuid(7).b  (1  7))) {
+   unsigned long cr4 = read_cr4();
+   if (write_cr4_checking(cr4 | CR4_SMEP_MASK) == GP_VECTOR)
+   goto done;
+   printf(Set SMEP in CR4 - expect #GP: FAIL!\n);
+   return 0;
+}
+set_cr4_smep(at-flags[AC_CPU_CR4_SMEP]);
 
 if (at-flags[AC_ACCESS_TWICE]) {
asm volatile (
@@ -544,6 +585,7 @@ int ac_test_do_access(ac_test_t *at)
   !pt_match(*at-pdep, at-expected_pde, at-ignore_pde),
   pde %x expected %x, *at-pdep, at-expected_pde);
 
+done:
 if (success  verbose) {
 printf(PASS\n);
 }
@@ -645,6 +687,59 @@ err:
 return 0;
 }
 
+static int check_smep_on_prefetch_pte(ac_pool_t *pool)
+{
+   ac_test_t at1;
+   int err_prepare_notwp, err_smep_notwp;
+   extern u64 ptl2[];
+
+   ac_test_init(at1, (void *)(0x123406001000));
+
+   at1.flags[AC_PDE_PRESENT] = 1;
+   at1.flags[AC_PTE_PRESENT] = 1;
+   at1.flags[AC_PDE_USER] = 1;
+   at1.flags[AC_PTE_USER] = 1;
+   at1.flags[AC_PDE_ACCESSED] = 1;
+   at1.flags[AC_PTE_ACCESSED] = 1;
+   at1.flags[AC_CPU_CR4_SMEP] = 1;
+   at1.flags[AC_CPU_CR0_WP] = 0;
+   at1.flags[AC_ACCESS_WRITE] = 1;
+   ac_test_setup_pte(at1, pool);
+   ptl2[2] -= 0x4;
+
+   /*
+* Here we write the ro user page when
+* cr0.wp=0, then we execute it and SMEP
+* fault should happen.
+*/
+   err_prepare_notwp 

Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 12:23:38PM +0200, Hannes Reinecke wrote:
 The general idea here is that we can support NPIV.
 With NPIV we'll have several scsi_hosts, each of which is assigned a
 different set of LUNs by the array.
 With virtio we need to able to react on LUN remapping on the array
 side, ie we need to be able to issue a 'REPORT LUNS' command and
 add/remove LUNs on the fly. This means we have to expose the
 scsi_host in some way via virtio.
 
 This is impossible with a one-to-one mapping between targets and
 LUNs. The actual bus-level pass-through will be just on the SCSI
 layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN
 remapping internally on the host is a totally different matter.
 Same goes for the transport details; I doubt we will expose all the
 dingy details of the various transports, but rather restrict
 ourselves to an abstract transport.

If we want to support traditional NPIV that's what we have to do.
I still hope we'll see broad SR-IOV support for FC adapters soon,
which would ease all this greatly.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Michael S. Tsirkin
On Wed, Jun 29, 2011 at 12:06:29PM +0200, Paolo Bonzini wrote:
 On 06/29/2011 12:03 PM, Christoph Hellwig wrote:
   I agree here, in fact I misread Hannes's comment as if a driver
   uses more than one queue it is responsibility of the driver to
   ensure strict request ordering.  If you send requests to different
   queues, you know that those requests are independent.  I don't think
   anything else is feasible in the virtio framework.
 
 That doesn't really fit very well with the SAM model.  If we want
 to use multiple queues for a single LUN it has to be transparent to
 the SCSI command stream.  Then again I don't quite see the use for
 that anyway.
 
 Agreed, I see a use for multiple queues (MSI-X), but not for
 multiple queues shared by a single LUN.
 
 Paolo

Then let's make it explicit in the spec?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Paolo Bonzini

On 06/29/2011 12:31 PM, Michael S. Tsirkin wrote:

On Wed, Jun 29, 2011 at 12:06:29PM +0200, Paolo Bonzini wrote:

On 06/29/2011 12:03 PM, Christoph Hellwig wrote:

  I agree here, in fact I misread Hannes's comment as if a driver
  uses more than one queue it is responsibility of the driver to
  ensure strict request ordering.  If you send requests to different
  queues, you know that those requests are independent.  I don't think
  anything else is feasible in the virtio framework.


That doesn't really fit very well with the SAM model.  If we want
to use multiple queues for a single LUN it has to be transparent to
the SCSI command stream.  Then again I don't quite see the use for
that anyway.


Agreed, I see a use for multiple queues (MSI-X), but not for
multiple queues shared by a single LUN.


Then let's make it explicit in the spec?


What, forbid it or say ordering is not guaranteed?  The latter is 
already explicit with the wording suggested in the thread.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/17] KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down

2011-06-29 Thread Paul Mackerras
This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
pass down some of the calls it gets to the lower-level subarchitecture
specific code.  The lower-level implementations (in booke.c and book3s.c)
are no-ops.  The coming book3s_hv.c will need this.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_ppc.h |7 +++
 arch/powerpc/kvm/book3s_pr.c   |   20 
 arch/powerpc/kvm/booke.c   |   20 
 arch/powerpc/kvm/powerpc.c |9 ++---
 4 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index c662f14..9b6f3f9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -111,6 +111,13 @@ extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
 extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu);
 extern void kvmppc_map_magic(struct kvm_vcpu *vcpu);
 
+extern int kvmppc_core_init_vm(struct kvm *kvm);
+extern void kvmppc_core_destroy_vm(struct kvm *kvm);
+extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
+   struct kvm_userspace_memory_region *mem);
+extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
+   struct kvm_userspace_memory_region *mem);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index fcdc97e..72b20b8 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -984,6 +984,26 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
return ret;
 }
 
+int kvmppc_core_prepare_memory_region(struct kvm *kvm,
+ struct kvm_userspace_memory_region *mem)
+{
+   return 0;
+}
+
+void kvmppc_core_commit_memory_region(struct kvm *kvm,
+   struct kvm_userspace_memory_region *mem)
+{
+}
+
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 static int kvmppc_book3s_init(void)
 {
int r;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9f2e4a5..9066325 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -865,6 +865,26 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
return -ENOTSUPP;
 }
 
+int kvmppc_core_prepare_memory_region(struct kvm *kvm,
+ struct kvm_userspace_memory_region *mem)
+{
+   return 0;
+}
+
+void kvmppc_core_commit_memory_region(struct kvm *kvm,
+   struct kvm_userspace_memory_region *mem)
+{
+}
+
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 int __init kvmppc_booke_init(void)
 {
unsigned long ivor[16];
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 24e2b64..0c80e15 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -148,7 +148,7 @@ void kvm_arch_check_processor_compat(void *rtn)
 
 int kvm_arch_init_vm(struct kvm *kvm)
 {
-   return 0;
+   return kvmppc_core_init_vm(kvm);
 }
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
@@ -164,6 +164,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm-vcpus[i] = NULL;
 
atomic_set(kvm-online_vcpus, 0);
+
+   kvmppc_core_destroy_vm(kvm);
+
mutex_unlock(kvm-lock);
 }
 
@@ -212,7 +215,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc)
 {
-   return 0;
+   return kvmppc_core_prepare_memory_region(kvm, mem);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
@@ -220,7 +223,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_memory_slot old,
int user_alloc)
 {
-   return;
+   kvmppc_core_commit_memory_region(kvm, mem);
 }
 
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/17] powerpc, KVM: Rework KVM checks in first-level interrupt handlers

2011-06-29 Thread Paul Mackerras
Instead of branching out-of-line with the DO_KVM macro to check if we
are in a KVM guest at the time of an interrupt, this moves the KVM
check inline in the first-level interrupt handlers.  This speeds up
the non-KVM case and makes sure that none of the interrupt handlers
are missing the check.

Because the first-level interrupt handlers are now larger, some things
had to be move out of line in exceptions-64s.S.

This all necessitated some minor changes to the interrupt entry code
in KVM.  This also streamlines the book3s_32 KVM test.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/exception-64s.h   |  121 --
 arch/powerpc/kernel/exceptions-64s.S   |  189 +---
 arch/powerpc/kvm/book3s_rmhandlers.S   |   78 ++--
 arch/powerpc/kvm/book3s_segment.S  |7 +
 arch/powerpc/platforms/iseries/exception.S |2 +-
 arch/powerpc/platforms/iseries/exception.h |4 +-
 6 files changed, 247 insertions(+), 154 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index f5dfe34..b6a3a44 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -61,19 +61,22 @@
 #define EXC_HV H
 #define EXC_STD
 
-#define EXCEPTION_PROLOG_1(area)   \
+#define __EXCEPTION_PROLOG_1(area, extra, vec) \
GET_PACA(r13);  \
std r9,area+EX_R9(r13); /* save r9 - r12 */ \
std r10,area+EX_R10(r13);   \
-   std r11,area+EX_R11(r13);   \
-   std r12,area+EX_R12(r13);   \
BEGIN_FTR_SECTION_NESTED(66);   \
mfspr   r10,SPRN_CFAR;  \
std r10,area+EX_CFAR(r13);  \
END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66); \
-   GET_SCRATCH0(r9);   \
-   std r9,area+EX_R13(r13);\
-   mfcrr9
+   mfcrr9; \
+   extra(vec); \
+   std r11,area+EX_R11(r13);   \
+   std r12,area+EX_R12(r13);   \
+   GET_SCRATCH0(r10);  \
+   std r10,area+EX_R13(r13)
+#define EXCEPTION_PROLOG_1(area, extra, vec)   \
+   __EXCEPTION_PROLOG_1(area, extra, vec)
 
 #define __EXCEPTION_PROLOG_PSERIES_1(label, h) \
ld  r12,PACAKBASE(r13); /* get high part of label */   \
@@ -85,13 +88,54 @@
mtspr   SPRN_##h##SRR1,r10; \
h##rfid;\
b   .   /* prevent speculative execution */
-#define EXCEPTION_PROLOG_PSERIES_1(label, h) \
+#define EXCEPTION_PROLOG_PSERIES_1(label, h)   \
__EXCEPTION_PROLOG_PSERIES_1(label, h)
 
-#define EXCEPTION_PROLOG_PSERIES(area, label, h)   \
-   EXCEPTION_PROLOG_1(area);   \
+#define EXCEPTION_PROLOG_PSERIES(area, label, h, extra, vec)   \
+   EXCEPTION_PROLOG_1(area, extra, vec);   \
EXCEPTION_PROLOG_PSERIES_1(label, h);
 
+#define __KVMTEST(n)   \
+   lbz r10,PACA_KVM_SVCPU+SVCPU_IN_GUEST(r13); \
+   cmpwi   r10,0;  \
+   bne do_kvm_##n
+
+#define __KVM_HANDLER(area, h, n)  \
+do_kvm_##n:\
+   ld  r10,area+EX_R10(r13);   \
+   stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13);  \
+   ld  r9,area+EX_R9(r13); \
+   std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \
+   li  r12,n;  \
+   b   kvmppc_interrupt
+
+#define __KVM_HANDLER_SKIP(area, h, n) \
+do_kvm_##n:\
+   cmpwi   r10,KVM_GUEST_MODE_SKIP;\
+   ld  r10,area+EX_R10(r13);   \
+   beq 89f;\
+   stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13);  \
+   ld  r9,area+EX_R9(r13); 

[PATCH 07/17] KVM: PPC: Move guest enter/exit down into subarch-specific code

2011-06-29 Thread Paul Mackerras
Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
calls in powerpc.c, this moves them down into the subarch-specific
book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
call in book3s_pr.c, and will be needed for when we do SMT4 guest
support in the book3s hypervisor mode code.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_ppc.h   |1 +
 arch/powerpc/kvm/book3s_interrupts.S |2 +-
 arch/powerpc/kvm/book3s_pr.c |   12 ++--
 arch/powerpc/kvm/booke.c |   13 +
 arch/powerpc/kvm/powerpc.c   |6 +-
 5 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9b6f3f9..48b7ab7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -42,6 +42,7 @@ enum emulation_result {
EMULATE_AGAIN,/* something went wrong. go again */
 };
 
+extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
 extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
 extern char kvmppc_handlers_start[];
 extern unsigned long kvmppc_handler_len;
diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
b/arch/powerpc/kvm/book3s_interrupts.S
index 2f0bc92..8c5e0e1 100644
--- a/arch/powerpc/kvm/book3s_interrupts.S
+++ b/arch/powerpc/kvm/book3s_interrupts.S
@@ -85,7 +85,7 @@
  *  r3: kvm_run pointer
  *  r4: vcpu pointer
  */
-_GLOBAL(__kvmppc_vcpu_entry)
+_GLOBAL(__kvmppc_vcpu_run)
 
 kvm_start_entry:
/* Write correct stack frame */
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 72b20b8..0c0d3f2 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -891,8 +891,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
vfree(vcpu_book3s);
 }
 
-extern int __kvmppc_vcpu_entry(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
-int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
+int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 {
int ret;
double fpr[32][TS_FPRWIDTH];
@@ -944,14 +943,15 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
/* Remember the MSR with disabled extensions */
ext_msr = current-thread.regs-msr;
 
-   /* XXX we get called with irq disabled - change that! */
-   local_irq_enable();
-
/* Preload FPU if it's enabled */
if (vcpu-arch.shared-msr  MSR_FP)
kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
 
-   ret = __kvmppc_vcpu_entry(kvm_run, vcpu);
+   kvm_guest_enter();
+
+   ret = __kvmppc_vcpu_run(kvm_run, vcpu);
+
+   kvm_guest_exit();
 
local_irq_disable();
 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9066325..ee45fa0 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -312,6 +312,19 @@ void kvmppc_core_deliver_interrupts(struct kvm_vcpu *vcpu)
vcpu-arch.shared-int_pending = 0;
 }
 
+int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
+{
+   int ret;
+
+   local_irq_disable();
+   kvm_guest_enter();
+   ret = __kvmppc_vcpu_run(kvm_run, vcpu);
+   kvm_guest_exit();
+   local_irq_enable();
+
+   return ret;
+}
+
 /**
  * kvmppc_handle_exit
  *
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 0c80e15..026036e 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -500,11 +500,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
kvmppc_core_deliver_interrupts(vcpu);
 
-   local_irq_disable();
-   kvm_guest_enter();
-   r = __kvmppc_vcpu_run(run, vcpu);
-   kvm_guest_exit();
-   local_irq_enable();
+   r = kvmppc_vcpu_run(run, vcpu);
 
if (vcpu-sigset_active)
sigprocmask(SIG_SETMASK, sigsaved, NULL);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/17] KVM: PPC: Deliver program interrupts right away instead of queueing them

2011-06-29 Thread Paul Mackerras
Doing so means that we don't have to save the flags anywhere and gets
rid of the last reference to to_book3s(vcpu) in arch/powerpc/kvm/book3s.c.

Doing so is OK because a program interrupt won't be generated at the
same time as any other synchronous interrupt.  If a program interrupt
and an asynchronous interrupt (external or decrementer) are generated
at the same time, the program interrupt will be delivered, which is
correct because it has a higher priority, and then the asynchronous
interrupt will be masked.

We don't ever generate system reset or machine check interrupts to the
guest, but if we did, then we would need to make sure they got delivered
rather than the program interrupt.  The current code would be wrong in
this situation anyway since it would deliver the program interrupt as
well as the reset/machine check interrupt.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 163e3e1..f68a34d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -129,8 +129,8 @@ void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, 
unsigned int vec)
 
 void kvmppc_core_queue_program(struct kvm_vcpu *vcpu, ulong flags)
 {
-   to_book3s(vcpu)-prog_flags = flags;
-   kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_PROGRAM);
+   /* might as well deliver this straight away */
+   kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_PROGRAM, flags);
 }
 
 void kvmppc_core_queue_dec(struct kvm_vcpu *vcpu)
@@ -170,7 +170,6 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, 
unsigned int priority)
 {
int deliver = 1;
int vec = 0;
-   ulong flags = 0ULL;
bool crit = kvmppc_critical_section(vcpu);
 
switch (priority) {
@@ -206,7 +205,6 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, 
unsigned int priority)
break;
case BOOK3S_IRQPRIO_PROGRAM:
vec = BOOK3S_INTERRUPT_PROGRAM;
-   flags = to_book3s(vcpu)-prog_flags;
break;
case BOOK3S_IRQPRIO_VSX:
vec = BOOK3S_INTERRUPT_VSX;
@@ -237,7 +235,7 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, 
unsigned int priority)
 #endif
 
if (deliver)
-   kvmppc_inject_interrupt(vcpu, vec, flags);
+   kvmppc_inject_interrupt(vcpu, vec, 0);
 
return deliver;
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/17] KVM: PPC: Handle some PAPR hcalls in the kernel

2011-06-29 Thread Paul Mackerras
This adds the infrastructure for handling PAPR hcalls in the kernel,
either early in the guest exit path while we are still in real mode,
or later once the MMU has been turned back on and we are in the full
kernel context.  The advantage of handling hcalls in real mode if
possible is that we avoid two partition switches -- and this will
become more important when we support SMT4 guests, since a partition
switch means we have to pull all of the threads in the core out of
the guest.  The disadvantage is that we can only access the kernel
linear mapping, not anything vmalloced or ioremapped, since the MMU
is off.

This also adds code to handle the following hcalls in real mode:

H_ENTER   Add an HPTE to the hashed page table
H_REMOVE  Remove an HPTE from the hashed page table
H_READRead HPTEs from the hashed page table
H_PROTECT Change the protection bits in an HPTE
H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
H_SET_DABRSet the data address breakpoint register

Plus code to handle the following hcalls in the kernel:

H_CEDEIdle the vcpu until an interrupt or H_PROD hcall arrives
H_PRODWake up a ceded vcpu
H_REGISTER_VPA Register a virtual processor area (VPA)

The code that runs in real mode has to be in the base kernel, not in
the module, if KVM is compiled as a module.  The real-mode code can
only access the kernel linear mapping, not vmalloc or ioremap space.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/hvcall.h   |5 +
 arch/powerpc/include/asm/kvm_host.h |   11 +
 arch/powerpc/include/asm/kvm_ppc.h  |1 +
 arch/powerpc/kernel/asm-offsets.c   |2 +
 arch/powerpc/kvm/Makefile   |8 +-
 arch/powerpc/kvm/book3s_hv.c|  170 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  368 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  158 +-
 arch/powerpc/kvm/powerpc.c  |2 +-
 9 files changed, 718 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_mmu.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index fd8201d..1c324ff 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -29,6 +29,10 @@
 #define H_LONG_BUSY_ORDER_100_SEC  9905  /* Long busy, hint that 100sec \
 is a good time to retry */
 #define H_LONG_BUSY_END_RANGE  9905  /* End of long busy range */
+
+/* Internal value used in book3s_hv kvm support; not returned to guests */
+#define H_TOO_HARD 
+
 #define H_HARDWARE -1  /* Hardware error */
 #define H_FUNCTION -2  /* Function not supported */
 #define H_PRIVILEGE-3  /* Caller not privileged */
@@ -100,6 +104,7 @@
 #define H_PAGE_SET_ACTIVE  H_PAGE_STATE_CHANGE
 #define H_AVPN (1UL(63-32))  /* An avpn is provided as a 
sanity test */
 #define H_ANDCOND  (1UL(63-33))
+#define H_LOCAL(1UL(63-35))
 #define H_ICACHE_INVALIDATE(1UL(63-40))  /* icbi, etc.  (ignored for IO 
pages) */
 #define H_ICACHE_SYNCHRONIZE   (1UL(63-41))  /* dcbst, icbi, etc (ignored 
for IO pages */
 #define H_COALESCE_CAND(1UL(63-42))  /* page is a good candidate for 
coalescing */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4a3f790..6ebf172 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -59,6 +59,10 @@ struct kvm;
 struct kvm_run;
 struct kvm_vcpu;
 
+struct lppaca;
+struct slb_shadow;
+struct dtl;
+
 struct kvm_vm_stat {
u32 remote_tlb_flush;
 };
@@ -344,7 +348,14 @@ struct kvm_vcpu_arch {
u64 dec_expires;
unsigned long pending_exceptions;
u16 last_cpu;
+   u8 ceded;
+   u8 prodded;
u32 last_inst;
+
+   struct lppaca *vpa;
+   struct slb_shadow *slb_shadow;
+   struct dtl *dtl;
+   struct dtl *dtl_end;
int trap;
struct kvm_vcpu_arch_shared *shared;
unsigned long magic_page_pa; /* phys addr to map the magic page to */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0dafd53..2afe92e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -118,6 +118,7 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
 extern void kvmppc_map_vrma(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
+extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern int kvmppc_core_init_vm(struct kvm *kvm);
 extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9362674..c70d106 

[PATCH 01/17] KVM: PPC: Fix machine checks on 32-bit Book3S

2011-06-29 Thread Paul Mackerras
Commit 69acc0d3ba (KVM: PPC: Resolve real-mode handlers through
function exports) resulted in vcpu-arch.trampoline_lowmem and
vcpu-arch.trampoline_enter ending up with kernel virtual addresses
rather than physical addresses.  This is OK on 64-bit Book3S machines,
which ignore the top 4 bits of the effective address in real mode,
but on 32-bit Book3S machines, accessing these addresses in real mode
causes machine check interrupts, as the hardware uses the whole
effective address as the physical address in real mode.

This fixes the problem by using __pa() to convert these addresses
to physical addresses.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 73fdab8..83500fb 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -28,6 +28,7 @@
 #include asm/kvm_ppc.h
 #include asm/kvm_book3s.h
 #include asm/mmu_context.h
+#include asm/page.h
 #include linux/gfp.h
 #include linux/sched.h
 #include linux/vmalloc.h
@@ -1342,8 +1343,8 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
unsigned int id)
vcpu_book3s-slb_nr = 64;
 
/* remember where some real-mode handlers are */
-   vcpu-arch.trampoline_lowmem = (ulong)kvmppc_handler_lowmem_trampoline;
-   vcpu-arch.trampoline_enter = (ulong)kvmppc_handler_trampoline_enter;
+   vcpu-arch.trampoline_lowmem = __pa(kvmppc_handler_lowmem_trampoline);
+   vcpu-arch.trampoline_enter = __pa(kvmppc_handler_trampoline_enter);
vcpu-arch.highmem_handler = (ulong)kvmppc_handler_highmem;
 #ifdef CONFIG_PPC_BOOK3S_64
vcpu-arch.rmcall = *(ulong*)kvmppc_rmcall;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/17] powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits

2011-06-29 Thread Paul Mackerras
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06.  We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.

Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0).  On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.

Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like

END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)

The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.

Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/cputable.h|   14 --
 arch/powerpc/include/asm/reg.h |   16 
 arch/powerpc/kernel/cpu_setup_power7.S |4 ++--
 arch/powerpc/kernel/cpu_setup_ppc970.S |   26 ++
 arch/powerpc/kernel/exceptions-64s.S   |4 ++--
 arch/powerpc/kernel/paca.c |2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|3 ++-
 arch/powerpc/kvm/book3s_hv.c   |3 ++-
 arch/powerpc/kvm/book3s_hv_builtin.c   |4 ++--
 arch/powerpc/kvm/book3s_segment.S  |2 +-
 arch/powerpc/mm/hash_native_64.c   |4 ++--
 11 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index c0d842c..e30442c 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -179,8 +179,9 @@ extern const char *powerpc_base_platform;
 #define LONG_ASM_CONST(x)  0
 #endif
 
-
-#define CPU_FTR_HVMODE_206 LONG_ASM_CONST(0x0008)
+#define CPU_FTR_HVMODE LONG_ASM_CONST(0x0002)
+#define CPU_FTR_ARCH_201   LONG_ASM_CONST(0x0004)
+#define CPU_FTR_ARCH_206   LONG_ASM_CONST(0x0008)
 #define CPU_FTR_CFAR   LONG_ASM_CONST(0x0010)
 #define CPU_FTR_IABR   LONG_ASM_CONST(0x0020)
 #define CPU_FTR_MMCRA  LONG_ASM_CONST(0x0040)
@@ -401,9 +402,10 @@ extern const char *powerpc_base_platform;
CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ | \
CPU_FTR_STCX_CHECKS_ADDRESS)
 #define CPU_FTRS_PPC970(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
-   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
+   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_201 | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \
-   CPU_FTR_CP_USE_DCBTZ | CPU_FTR_STCX_CHECKS_ADDRESS)
+   CPU_FTR_CP_USE_DCBTZ | CPU_FTR_STCX_CHECKS_ADDRESS | \
+   CPU_FTR_HVMODE)
 #define CPU_FTRS_POWER5(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -417,13 +419,13 @@ extern const char *powerpc_base_platform;
CPU_FTR_DSCR | CPU_FTR_UNALIGNED_LD_STD | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_CFAR)
 #define CPU_FTRS_POWER7 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
-   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_HVMODE_206 |\
+   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-   CPU_FTR_ICSWX | CPU_FTR_CFAR)
+   CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE)
 #define CPU_FTRS_CELL  (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 20a053c..ddbe57a 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -307,6 +307,7 @@
 #define SPRN_HASH1 0x3D2   /* Primary Hash Address Register */
 #define SPRN_HASH2 0x3D3   /* Secondary Hash Address Resgister */
 #define SPRN_HID0  0x3F0   /* Hardware Implementation Register 0 */
+#define HID0_HDICE_SH  (63 - 23)   /* 

[RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate

2011-06-29 Thread Paul Mackerras
This new ioctl allows userspace to specify what paravirtualization
interface (if any) KVM should implement, what architecture version
the guest virtual processors should conform to, and whether the guest
can be permitted to use a real supervisor mode.

At present the only effect of the ioctl is to indicate whether the
requested emulation is available, but in future it may be used to
select between different emulation techniques (book3s_pr vs. book3s_hv)
or set the CPU compatibility mode for the guest.

If book3s_pr KVM is enabled in the kernel config, then this new
ioctl accepts platform values of KVM_PPC_PV_NONE and KVM_PPC_PV_KVM,
but not KVM_PPC_PV_SPAPR.  If book3s_hv KVM is enabled, then this
ioctl requires that the platform is KVM_PPC_PV_SPAPR and the
guest_arch field contains one of 201 or 206 (for architecture versions
2.01 and 2.06) -- when running on a PPC970, it must contain 201, and
when running on a POWER7, it must contain 206.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt   |   35 +++
 arch/powerpc/include/asm/kvm.h  |   15 +++
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/powerpc.c  |   28 
 include/linux/kvm.h |1 +
 5 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index b0e4b9c..3ab012c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual 
machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_PPC_SET_PLATFORM
+
+Capability: none
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_ppc_set_platform (in)
+Returns: 0, or -1 on error
+
+This is used by userspace to tell KVM what sort of platform it should
+emulate.  The return value of the ioctl tells userspace whether the
+emulation it is requesting is supported by KVM.
+
+struct kvm_ppc_set_platform {
+   __u16 platform; /* defines the OS/hypervisor ABI */
+   __u16 guest_arch;   /* e.g. decimal 206 for v2.06 */
+   __u32 flags;
+};
+
+/* Values for platform */
+#define KVM_PPC_PV_NONE0   /* bare-metal, 
non-paravirtualized */
+#define KVM_PPC_PV_KVM 1   /* as defined in kvm_para.h */
+#define KVM_PPC_PV_SPAPR   2   /* IBM Server PAPR (a la PowerVM) */
+
+/* Values for flags */
+#define KVM_PPC_CROSS_ARCH 1   /* guest architecture != host */
+
+The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a
+sufficiently different architecture to the host that the guest cannot
+be permitted to use supervisor mode.  For example, if the host is a
+64-bit machine and the guest is a 32-bit machine, then this bit should
+be set.
+
+The return value is 0 if KVM supports the requested emulation, or -1
+with errno == EINVAL if not.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index a4f6c85..0dd5cfb 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -287,4 +287,19 @@ struct kvm_allocate_rma {
__u64 rma_size;
 };
 
+/* for KVM_PPC_SET_PLATFORM */
+struct kvm_ppc_set_platform {
+   __u16 platform; /* defines the OS/hypervisor ABI */
+   __u16 guest_arch;   /* e.g. decimal 206 for v2.06 */
+   __u32 flags;
+};
+
+/* Values for platform */
+#define KVM_PPC_PV_NONE0   /* bare-metal, 
non-paravirtualized */
+#define KVM_PPC_PV_KVM 1   /* as defined in kvm_para.h */
+#define KVM_PPC_PV_SPAPR   2   /* IBM Server PAPR (a la PowerVM) */
+
+/* Values for flags */
+#define KVM_PPC_CROSS_ARCH 1   /* guest architecture != host */
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index cc22b28..00e7f1b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -167,6 +167,7 @@ struct kvmppc_rma_info {
 };
 
 struct kvm_arch {
+   struct kvm_ppc_set_platform platform;
 #ifdef CONFIG_KVM_BOOK3S_64_HV
unsigned long hpt_virt;
unsigned long ram_npages;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a107c9b..83265cd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -690,6 +690,34 @@ long kvm_arch_vm_ioctl(struct file *filp,
break;
}
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
+   case KVM_PPC_SET_PLATFORM: {
+   struct kvm_ppc_set_platform plat;
+   struct kvm *kvm = filp-private_data;
+
+   r = -EFAULT;
+   if (copy_from_user(plat, argp, 

[PATCH 08/17] powerpc: Set up LPCR for running guest partitions

2011-06-29 Thread Paul Mackerras
In hypervisor mode, the LPCR controls several aspects of guest
partitions, including virtual partition memory mode, and also controls
whether the hypervisor decrementer interrupts are enabled.  This sets
up LPCR at boot time so that guest partitions will use a virtual real
memory area (VRMA) composed of 16MB large pages, and hypervisor
decrementer interrupts are disabled.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/reg.h |4 
 arch/powerpc/kernel/cpu_setup_power7.S |   18 +++---
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index c5cae0d..d879a6b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -232,10 +232,12 @@
 #define   LPCR_VPM0(1ul  (63-0))
 #define   LPCR_VPM1(1ul  (63-1))
 #define   LPCR_ISL (1ul  (63-2))
+#define   LPCR_VC_SH   (63-2)
 #define   LPCR_DPFD_SH (63-11)
 #define   LPCR_VRMA_L  (1ul  (63-12))
 #define   LPCR_VRMA_LP0(1ul  (63-15))
 #define   LPCR_VRMA_LP1(1ul  (63-16))
+#define   LPCR_VRMASD_SH (63-16)
 #define   LPCR_RMLS0x1C00  /* impl dependent rmo limit sel */
 #define   LPCR_ILE 0x0200  /* !HV irqs set MSR:LE */
 #define   LPCR_PECE0x7000  /* powersave exit cause enable */
@@ -243,8 +245,10 @@
 #define LPCR_PECE1 0x2000  /* decrementer can cause exit */
 #define LPCR_PECE2 0x1000  /* machine check etc can cause exit */
 #define   LPCR_MER 0x0800  /* Mediated External Exception */
+#define   LPCR_LPES0x000c
 #define   LPCR_LPES0   0x0008  /* LPAR Env selector 0 */
 #define   LPCR_LPES1   0x0004  /* LPAR Env selector 1 */
+#define   LPCR_LPES_SH 2
 #define   LPCR_RMI 0x0002  /* real mode is cache inhibit */
 #define   LPCR_HDICE   0x0001  /* Hyp Decr enable (HV,PR,EE) */
 #define SPRN_LPID  0x13F   /* Logical Partition Identifier */
diff --git a/arch/powerpc/kernel/cpu_setup_power7.S 
b/arch/powerpc/kernel/cpu_setup_power7.S
index 4f9a93f..2ef6749 100644
--- a/arch/powerpc/kernel/cpu_setup_power7.S
+++ b/arch/powerpc/kernel/cpu_setup_power7.S
@@ -61,19 +61,23 @@ __init_LPCR:
 *   LPES = 0b01 (HSRR0/1 used for 0x500)
 *   PECE = 0b111
 *   DPFD = 4
+*   HDICE = 0
+*   VC = 0b100 (VPM0=1, VPM1=0, ISL=0)
+*   VRMASD = 0b1 (L=1, LP=00)
 *
 * Other bits untouched for now
 */
mfspr   r3,SPRN_LPCR
-   ori r3,r3,(LPCR_LPES0|LPCR_LPES1)
-   xorir3,r3, LPCR_LPES0
+   li  r5,1
+   rldimi  r3,r5, LPCR_LPES_SH, 64-LPCR_LPES_SH-2
ori r3,r3,(LPCR_PECE0|LPCR_PECE1|LPCR_PECE2)
-   li  r5,7
-   sldir5,r5,LPCR_DPFD_SH
-   andcr3,r3,r5
li  r5,4
-   sldir5,r5,LPCR_DPFD_SH
-   or  r3,r3,r5
+   rldimi  r3,r5, LPCR_DPFD_SH, 64-LPCR_DPFD_SH-3
+   clrrdi  r3,r3,1 /* clear HDICE */
+   li  r5,4
+   rldimi  r3,r5, LPCR_VC_SH, 0
+   li  r5,0x10
+   rldimi  r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
mtspr   SPRN_LPCR,r3
isync
blr
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/17] KVM: PPC: book3s_hv: Add support for PPC970-family processors

2011-06-29 Thread Paul Mackerras
This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode.  Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.

There are several differences between the PPC970 and POWER7 in how
guests are managed.  These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits.  Notably, on PPC970:

* The LPCR, LPID or RMOR registers don't exist, and the functions of
  those registers are provided by bits in HID4 and one bit in HID0.

* External interrupts can be directed to the hypervisor, but unlike
  POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
  SRR0/1 not HSRR0/1.

* There is no virtual RMA (VRMA) mode; the guest must use an RMO
  (real mode offset) area.

* The TLB entries are not tagged with the LPID, so it is necessary to
  flush the whole TLB on partition switch.  Furthermore, when switching
  partitions we have to ensure that no other CPU is executing the tlbie
  or tlbsync instructions in either the old or the new partition,
  otherwise undefined behaviour can occur.

* The PMU has 8 counters (PMC registers) rather than 6.

* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.

* The SLB has 64 entries rather than 32.

* There is no mediated external interrupt facility, so if we switch to
  a guest that has a virtual external interrupt pending but the guest
  has MSR[EE] = 0, we have to arrange to have an interrupt pending for
  it so that we can get control back once it re-enables interrupts.  We
  do that by sending ourselves an IPI with smp_send_reschedule after
  hard-disabling interrupts.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/exception-64s.h  |4 +
 arch/powerpc/include/asm/kvm_book3s_asm.h |2 +-
 arch/powerpc/include/asm/kvm_host.h   |2 +-
 arch/powerpc/kernel/asm-offsets.c |1 +
 arch/powerpc/kernel/exceptions-64s.S  |2 +-
 arch/powerpc/kvm/Kconfig  |   13 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   30 +++-
 arch/powerpc/kvm/book3s_hv.c  |   60 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c  |   11 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S   |   30 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |6 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  230 -
 arch/powerpc/kvm/powerpc.c|3 +
 arch/powerpc/mm/hash_native_64.c  |2 +-
 14 files changed, 354 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 69435da..8057f4f 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -246,6 +246,10 @@ label##_hv:
\
KVMTEST(vec);   \
_SOFTEN_TEST(EXC_HV)
 
+#define SOFTEN_TEST_HV_201(vec)
\
+   KVMTEST(vec);   \
+   _SOFTEN_TEST(EXC_STD)
+
 #define __MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra) \
HMT_MEDIUM; \
SET_SCRATCH0(r13);/* save r13 */\
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 9cfd543..ef7b368 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -82,7 +82,7 @@ struct kvmppc_host_state {
unsigned long xics_phys;
u64 dabr;
u64 host_mmcr[3];
-   u32 host_pmc[6];
+   u32 host_pmc[8];
u64 host_purr;
u64 host_spurr;
u64 host_dscr;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index f572d9c..cc22b28 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -353,7 +353,7 @@ struct kvm_vcpu_arch {
u32 dbsr;
 
u64 mmcr[3];
-   u32 pmc[6];
+   u32 pmc[8];
 
 #ifdef CONFIG_KVM_EXIT_TIMING
struct mutex exit_timing_lock;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index f4aba93..54b935f 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -128,6 +128,7 @@ int main(void)
DEFINE(ICACHEL1LINESPERPAGE, offsetof(struct ppc64_caches, 
ilines_per_page));
/* paca */
DEFINE(PACA_SIZE, sizeof(struct paca_struct));
+   DEFINE(PACA_LOCK_TOKEN, offsetof(struct paca_struct, lock_token));
DEFINE(PACAPACAINDEX, offsetof(struct paca_struct, paca_index));
DEFINE(PACAPROCSTART, offsetof(struct paca_struct, cpu_start));
DEFINE(PACAKSAVE, offsetof(struct 

[PATCH 09/17] KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu

2011-06-29 Thread Paul Mackerras
There are several fields in struct kvmppc_book3s_shadow_vcpu that
temporarily store bits of host state while a guest is running,
rather than anything relating to the particular guest or vcpu.
This splits them out into a new kvmppc_host_state structure and
modifies the definitions in asm-offsets.c to suit.

On 32-bit, we have a kvmppc_host_state structure inside the
kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
to get to them both with one pointer.  On 64-bit they are separate
fields in the PACA.  This means that on 64-bit we don't need to
copy the kvmppc_host_state in and out on vcpu load/unload, and
in future will mean that the book3s_hv code doesn't need a
shadow_vcpu struct in the PACA at all.  That does mean that we
have to be careful not to rely on any values persisting in the
hstate field of the paca across any point where we could block
or get preempted.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/exception-64s.h  |   10 ++--
 arch/powerpc/include/asm/kvm_book3s_asm.h |   27 ++---
 arch/powerpc/include/asm/paca.h   |1 +
 arch/powerpc/kernel/asm-offsets.c |   94 ++--
 arch/powerpc/kernel/exceptions-64s.S  |2 +-
 arch/powerpc/kvm/book3s_interrupts.S  |   19 ++
 arch/powerpc/kvm/book3s_rmhandlers.S  |   18 +++---
 arch/powerpc/kvm/book3s_segment.S |   76 ---
 8 files changed, 127 insertions(+), 120 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index b6a3a44..296c9b6 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -96,16 +96,16 @@
EXCEPTION_PROLOG_PSERIES_1(label, h);
 
 #define __KVMTEST(n)   \
-   lbz r10,PACA_KVM_SVCPU+SVCPU_IN_GUEST(r13); \
+   lbz r10,HSTATE_IN_GUEST(r13);   \
cmpwi   r10,0;  \
bne do_kvm_##n
 
 #define __KVM_HANDLER(area, h, n)  \
 do_kvm_##n:\
ld  r10,area+EX_R10(r13);   \
-   stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13);  \
+   stw r9,HSTATE_SCRATCH1(r13);\
ld  r9,area+EX_R9(r13); \
-   std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \
+   std r12,HSTATE_SCRATCH0(r13);   \
li  r12,n;  \
b   kvmppc_interrupt
 
@@ -114,9 +114,9 @@ do_kvm_##n: 
\
cmpwi   r10,KVM_GUEST_MODE_SKIP;\
ld  r10,area+EX_R10(r13);   \
beq 89f;\
-   stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13);  \
+   stw r9,HSTATE_SCRATCH1(r13);\
ld  r9,area+EX_R9(r13); \
-   std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \
+   std r12,HSTATE_SCRATCH0(r13);   \
li  r12,n;  \
b   kvmppc_interrupt;   \
 89:mtocrf  0x80,r9;\
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index d5a8a38..3126175 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -60,6 +60,22 @@ kvmppc_resume_\intno:
 
 #else  /*__ASSEMBLY__ */
 
+/*
+ * This struct goes in the PACA on 64-bit processors.  It is used
+ * to store host state that needs to be saved when we enter a guest
+ * and restored when we exit, but isn't specific to any particular
+ * guest or vcpu.  It also has some scratch fields used by the guest
+ * exit code.
+ */
+struct kvmppc_host_state {
+   ulong host_r1;
+   ulong host_r2;
+   ulong vmhandler;
+   ulong scratch0;
+   ulong scratch1;
+   u8 in_guest;
+};
+
 struct kvmppc_book3s_shadow_vcpu {
ulong gpr[14];
u32 cr;
@@ -73,17 +89,12 @@ struct kvmppc_book3s_shadow_vcpu {
ulong shadow_srr1;
ulong fault_dar;
 
-   ulong host_r1;
-   ulong host_r2;
-   ulong handler;
-   ulong scratch0;
-   ulong scratch1;
-   ulong vmhandler;
-   u8 in_guest;
-
 #ifdef CONFIG_PPC_BOOK3S_32
u32 sr[16]; /* Guest SRs */
+
+   struct kvmppc_host_state hstate;
 #endif
+
 #ifdef CONFIG_PPC_BOOK3S_64
u8 slb_max; /* 

[PATCH 0/17] Hypervisor-mode KVM on POWER7 and PPC970

2011-06-29 Thread Paul Mackerras
The first patch of the following series is a pure bug-fix for 32-bit
kernels.

The remainder of the following series of patches enable KVM to exploit
the hardware hypervisor mode on 64-bit Power ISA Book3S machines.  At
present, POWER7 and PPC970 processors are supported.  (Note that the
PPC970 processors in Apple G5 machines don't have a usable hypervisor
mode and are not supported by these patches.)

Running the KVM host in hypervisor mode means that the guest can use
both supervisor mode and user mode.  That means that the guest can
execute supervisor-privilege instructions and access supervisor-
privilege registers.  In addition the hardware directs most exceptions
to the guest.  Thus we don't need to emulate any instructions in the
host.  Generally, the only times we need to exit the guest are when it
does a hypercall or when an external interrupt or host timer
(decrementer) interrupt occurs.

The focus of this KVM implementation is to run guests that use the
PAPR (Power Architecture Platform Requirements) paravirtualization
interface, which is the interface supplied by PowerVM on IBM pSeries
machines.  Currently the pseries machine type in qemu is only
supported by book3s_hv KVM, and book3s_hv KVM only supports the
pseries machine type.  That will hopefully change in future.

These patches are against the master branch of the kvm tree.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/17] KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode

2011-06-29 Thread Paul Mackerras
From: David Gibson d...@au1.ibm.com

This improves I/O performance for guests using the PAPR
paravirtualization interface by making the H_PUT_TCE hcall faster, by
implementing it in real mode.  H_PUT_TCE is used for updating virtual
IOMMU tables, and is used both for virtual I/O and for real I/O in the
PAPR interface.

Since this moves the IOMMU tables into the kernel, we define a new
KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
ioctl returns a file descriptor which can be used to mmap the newly
created table.  The qemu driver models use them in the same way as
userspace managed tables, but they can be updated directly by the
guest with a real-mode H_PUT_TCE implementation, reducing the number
of host/guest context switches during guest IO.

There are certain circumstances where it is useful for userland qemu
to write to the TCE table even if the kernel H_PUT_TCE path is used
most of the time.  Specifically, allowing this will avoid awkwardness
when we need to reset the table.  More importantly, we will in the
future need to write the table in order to restore its state after a
checkpoint resume or migration.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt|   35 +
 arch/powerpc/include/asm/kvm.h   |9 +++
 arch/powerpc/include/asm/kvm_book3s_64.h |2 +
 arch/powerpc/include/asm/kvm_host.h  |9 +++
 arch/powerpc/include/asm/kvm_ppc.h   |2 +
 arch/powerpc/kvm/Makefile|3 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c  |   73 +++
 arch/powerpc/kvm/book3s_hv.c |  116 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |2 +-
 arch/powerpc/kvm/powerpc.c   |   18 +
 include/linux/kvm.h  |2 +
 11 files changed, 268 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index e8875fe..a1d344d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1350,6 +1350,41 @@ The following flags are defined:
 If datamatch flag is set, the event will be signaled only if the written value
 to the registered address is equal to datamatch in struct kvm_ioeventfd.
 
+4.62 KVM_CREATE_SPAPR_TCE
+
+Capability: KVM_CAP_SPAPR_TCE
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce (in)
+Returns: file descriptor for manipulating the created TCE table
+
+This creates a virtual TCE (translation control entry) table, which
+is an IOMMU for PAPR-style virtual I/O.  It is used to translate
+logical addresses used in virtual I/O into guest physical addresses,
+and provides a scatter/gather capability for PAPR virtual I/O.
+
+/* for KVM_CAP_SPAPR_TCE */
+struct kvm_create_spapr_tce {
+   __u64 liobn;
+   __u32 window_size;
+};
+
+The liobn field gives the logical IO bus number for which to create a
+TCE table.  The window_size field specifies the size of the DMA window
+which this TCE table will translate - the table will contain one 64
+bit TCE entry for every 4kiB of the DMA window.
+
+When the guest issues an H_PUT_TCE hcall on a liobn for which a TCE
+table has been created using this ioctl(), the kernel will handle it
+in real mode, updating the TCE table.  H_PUT_TCE calls for other
+liobns will cause a vm exit and must be handled by userspace.
+
+The return value is a file descriptor which can be passed to mmap(2)
+to map the created TCE table into userspace.  This lets userspace read
+the entries written by kernel-handled H_PUT_TCE calls, and also lets
+userspace update the TCE table directly which is useful in some
+circumstances.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index d2ca5ed..c3ec990 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -22,6 +22,9 @@
 
 #include linux/types.h
 
+/* Select powerpc specific features in linux/kvm.h */
+#define __KVM_HAVE_SPAPR_TCE
+
 struct kvm_regs {
__u64 pc;
__u64 cr;
@@ -272,4 +275,10 @@ struct kvm_guest_debug_arch {
 #define KVM_INTERRUPT_UNSET-2U
 #define KVM_INTERRUPT_SET_LEVEL-3U
 
+/* for KVM_CAP_SPAPR_TCE */
+struct kvm_create_spapr_tce {
+   __u64 liobn;
+   __u32 window_size;
+};
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 5f73388..e43fe42 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -27,4 +27,6 @@ static inline struct kvmppc_book3s_shadow_vcpu 
*to_svcpu(struct kvm_vcpu *vcpu)
 }
 #endif
 
+#define SPAPR_TCE_SHIFT12
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git 

[PATCH 13/17] KVM: PPC: Allow book3s_hv guests to use SMT processor modes

2011-06-29 Thread Paul Mackerras
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7.  The host still has to run single-threaded.

This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability.  The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.

To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode.  KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline).  To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it.  Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.

When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host.  This number is exported
to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.

We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host.  We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.

When a vcore starts to run, it executes in the context of one of the
vcpu threads.  The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).

It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running.  In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest.  It synchronizes with the other threads via
the vcore-entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.

Note that there is no fixed relationship between the hardware thread
number and the vcpu number.  Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt |   13 ++
 arch/powerpc/include/asm/kvm.h|1 +
 arch/powerpc/include/asm/kvm_book3s_asm.h |2 +
 arch/powerpc/include/asm/kvm_host.h   |   46 -
 arch/powerpc/include/asm/kvm_ppc.h|   13 ++
 arch/powerpc/kernel/asm-offsets.c |6 +
 arch/powerpc/kernel/exceptions-64s.S  |   31 ++-
 arch/powerpc/kernel/idle_power7.S |2 -
 arch/powerpc/kvm/book3s_hv.c  |  316 ++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  168 +++-
 arch/powerpc/kvm/powerpc.c|4 +
 arch/powerpc/sysdev/xics/icp-native.c |9 +
 include/linux/kvm.h   |1 +
 13 files changed, 567 insertions(+), 45 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index a1d344d..6818713 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -180,6 +180,19 @@ KVM_CHECK_EXTENSION ioctl() to determine the value for 
max_vcpus at run-time.
 If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4
 cpus max.
 
+On powerpc using book3s_hv mode, the vcpus are mapped onto virtual
+threads in one or more virtual CPU cores.  (This is because the
+hardware requires all the hardware threads in a CPU core to be in the
+same partition.)  The KVM_CAP_PPC_SMT capability indicates the number
+of vcpus per virtual core (vcore).  The vcore id is obtained by
+dividing the vcpu id by the number of vcpus per vcore.  The vcpus in a
+given vcore will always be in the same physical core as each other
+(though that might be a different physical core from time to time).
+Userspace can control the threading (SMT) mode of the guest by its

[PATCH 02/17] KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s

2011-06-29 Thread Paul Mackerras
This moves the slb field, which represents the state of the emulated
SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
This is in accord with the principle that the kvm_vcpu_arch struct
represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
struct holds the auxiliary data structures used in the emulation.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s.h |   35 +---
 arch/powerpc/include/asm/kvm_host.h   |   34 +++-
 arch/powerpc/kvm/book3s.c |9 ++--
 arch/powerpc/kvm/book3s_64_mmu.c  |   54 +++-
 arch/powerpc/kvm/book3s_mmu_hpte.c|   71 +++-
 arch/powerpc/kvm/trace.h  |2 +-
 6 files changed, 107 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 70c409b..f7b2baf 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -24,20 +24,6 @@
 #include linux/kvm_host.h
 #include asm/kvm_book3s_asm.h
 
-struct kvmppc_slb {
-   u64 esid;
-   u64 vsid;
-   u64 orige;
-   u64 origv;
-   bool valid  : 1;
-   bool Ks : 1;
-   bool Kp : 1;
-   bool nx : 1;
-   bool large  : 1;/* PTEs are 16MB */
-   bool tb : 1;/* 1TB segment */
-   bool class  : 1;
-};
-
 struct kvmppc_bat {
u64 raw;
u32 bepi;
@@ -67,11 +53,22 @@ struct kvmppc_sid_map {
 #define VSID_POOL_SIZE (SID_CONTEXTS * 16)
 #endif
 
+struct hpte_cache {
+   struct hlist_node list_pte;
+   struct hlist_node list_pte_long;
+   struct hlist_node list_vpte;
+   struct hlist_node list_vpte_long;
+   struct rcu_head rcu_head;
+   u64 host_va;
+   u64 pfn;
+   ulong slot;
+   struct kvmppc_pte pte;
+};
+
 struct kvmppc_vcpu_book3s {
struct kvm_vcpu vcpu;
struct kvmppc_book3s_shadow_vcpu *shadow_vcpu;
struct kvmppc_sid_map sid_map[SID_MAP_NUM];
-   struct kvmppc_slb slb[64];
struct {
u64 esid;
u64 vsid;
@@ -81,7 +78,6 @@ struct kvmppc_vcpu_book3s {
struct kvmppc_bat dbat[8];
u64 hid[6];
u64 gqr[8];
-   int slb_nr;
u64 sdr1;
u64 hior;
u64 msr_mask;
@@ -94,6 +90,13 @@ struct kvmppc_vcpu_book3s {
 #endif
int context_id[SID_CONTEXTS];
ulong prog_flags; /* flags to inject when giving a 700 trap */
+
+   struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
+   struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
+   struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
+   struct hlist_head hpte_hash_vpte_long[HPTEG_HASH_NUM_VPTE_LONG];
+   int hpte_cache_count;
+   spinlock_t mmu_lock;
 };
 
 #define CONTEXT_HOST   0
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6e05b2d..069eb9f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -163,16 +163,18 @@ struct kvmppc_mmu {
bool (*is_dcbz32)(struct kvm_vcpu *vcpu);
 };
 
-struct hpte_cache {
-   struct hlist_node list_pte;
-   struct hlist_node list_pte_long;
-   struct hlist_node list_vpte;
-   struct hlist_node list_vpte_long;
-   struct rcu_head rcu_head;
-   u64 host_va;
-   u64 pfn;
-   ulong slot;
-   struct kvmppc_pte pte;
+struct kvmppc_slb {
+   u64 esid;
+   u64 vsid;
+   u64 orige;
+   u64 origv;
+   bool valid  : 1;
+   bool Ks : 1;
+   bool Kp : 1;
+   bool nx : 1;
+   bool large  : 1;/* PTEs are 16MB */
+   bool tb : 1;/* 1TB segment */
+   bool class  : 1;
 };
 
 struct kvm_vcpu_arch {
@@ -187,6 +189,9 @@ struct kvm_vcpu_arch {
ulong highmem_handler;
ulong rmcall;
ulong host_paca_phys;
+   struct kvmppc_slb slb[64];
+   int slb_max;/* # valid entries in slb[] */
+   int slb_nr; /* total number of entries in SLB */
struct kvmppc_mmu mmu;
 #endif
 
@@ -305,15 +310,6 @@ struct kvm_vcpu_arch {
struct kvm_vcpu_arch_shared *shared;
unsigned long magic_page_pa; /* phys addr to map the magic page to */
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
-
-#ifdef CONFIG_PPC_BOOK3S
-   struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
-   struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
-   struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
-   struct hlist_head hpte_hash_vpte_long[HPTEG_HASH_NUM_VPTE_LONG];
-   int hpte_cache_count;
-   spinlock_t mmu_lock;
-#endif
 };
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/kvm/book3s.c 

[PATCH 14/17] KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests

2011-06-29 Thread Paul Mackerras
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility.  These processors require a physically
contiguous, aligned area of memory for each guest.  When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access.  The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.

Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator.  The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.

KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs.  The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.

This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA.  It
also returns the size of the RMA in the argument structure.

Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace.  To cope with this, we now preallocate the
kvm-arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory.  Subsequently we will get rid of this
array and use memory associated with each memslot instead.

This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region.  Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB.  However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.

Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu-arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest.  This moves the LPCR value into the
kvm-arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt   |   32 
 arch/powerpc/include/asm/kvm.h  |5 +
 arch/powerpc/include/asm/kvm_book3s.h   |8 -
 arch/powerpc/include/asm/kvm_host.h |   15 ++-
 arch/powerpc/include/asm/kvm_ppc.h  |   10 ++
 arch/powerpc/include/asm/reg.h  |1 +
 arch/powerpc/kernel/asm-offsets.c   |4 +-
 arch/powerpc/kernel/setup_64.c  |3 +
 arch/powerpc/kvm/Makefile   |3 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   97 +---
 arch/powerpc/kvm/book3s_hv.c|  259 +-
 arch/powerpc/kvm/book3s_hv_builtin.c|  152 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   19 ++-
 arch/powerpc/kvm/powerpc.c  |   13 ++
 include/linux/kvm.h |3 +
 15 files changed, 505 insertions(+), 119 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_builtin.c

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6818713..b0e4b9c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1398,6 +1398,38 @@ the entries written by kernel-handled H_PUT_TCE calls, 
and also lets
 userspace update the TCE table directly which is useful in some
 circumstances.
 
+4.63 KVM_ALLOCATE_RMA
+
+Capability: KVM_CAP_PPC_RMA
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_allocate_rma (out)
+Returns: file descriptor for mapping the allocated RMA
+
+This allocates a Real Mode Area (RMA) from the pool allocated at boot
+time by the kernel.  An RMA is a physically-contiguous, aligned region
+of memory used on older POWER processors to provide the memory which
+will be accessed by real-mode (MMU off) accesses in a KVM guest.
+POWER processors support a set of sizes for the RMA that usually
+includes 64MB, 128MB, 256MB and some larger powers of two.
+
+/* for KVM_ALLOCATE_RMA */
+struct kvm_allocate_rma {
+   __u64 rma_size;
+};
+
+The return value is a file descriptor which can be passed to mmap(2)

Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 04:21 PM, Avi Kivity wrote:


 -if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
 -== X86EMUL_CONTINUE)
 +if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
   return X86EMUL_CONTINUE;
 
 This doesn't perform the cpl check.
 

Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl
is checked in this function, it is not enough?

 I suggest dropping this part for now and doing it later.
 

OK, i will post this part in the separate patchset. :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 04:24 PM, Avi Kivity wrote:

 +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 +   gpa_t *gpa, struct x86_exception *exception,
 +   bool write)
 +{
 +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 +
 +if (write)
 +access |= PFERR_WRITE_MASK;
 
 Needs fetch as well so NX/SMEP can work.
 

This function is only used by read/write emulator, execute permission is
not needed for read/write, no?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 04:37 PM, Avi Kivity wrote:

 +struct read_write_emulator_ops {
 +int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
 +  int bytes);
 +int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
 +  void *val, int bytes);
 +int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
 +   int bytes, void *val);
 +int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
 +void *val, int bytes);
 +bool write;
 +};
 
 
 Interesting!
 
 This structure combines two unrelated operations, though.  One is the 
 internals of the iteration on a virtual address that is split to various 
 physical addresses.  The other is the interaction with userspace on mmio 
 exits.  They should be split, but I think it's fine to do it in a later 
 patch.  This series is long enough already.
 
 I was also annoyed by the duplication.  They way I thought of fixing it is 
 having gva_to_gpa() return two gpas, and having the access function accept 
 gpa vectors.  The reason was so that we can implemented locked cross-page 
 operations (which we now emulate as unlocked writes).
 
 But I think we can do without it, and instead emulated locked cross-page ops 
 by stalling all other vcpus while we write, or by unmapping the pages 
 involved.  It isn't pretty but it doesn't need to be fast since it's a very 
 rare operation.  So I think we can go with your approach.
 

OK, i'll post it in the separate patchset, thanks, Avi.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 04:48 PM, Avi Kivity wrote:
 On 06/22/2011 05:31 PM, Xiao Guangrong wrote:
 If the page fault is caused by mmio, we can cache the mmio info, later, we do
 not need to walk guest page table and quickly know it is a mmio fault while 
 we
 emulate the mmio instruction
 
 Does this work if the mmio spans two pages?
 

If the mmio spans two pages, we already split the emulation into two parts,
and the mmio cache info is only matched for one page, so i thinks it works
well :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Avi Kivity

On 06/29/2011 01:56 PM, Xiao Guangrong wrote:

On 06/29/2011 04:24 PM, Avi Kivity wrote:

  +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
  +   gpa_t *gpa, struct x86_exception *exception,
  +   bool write)
  +{
  +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
  +
  +if (write)
  +access |= PFERR_WRITE_MASK;

  Needs fetch as well so NX/SMEP can work.


This function is only used by read/write emulator, execute permission is
not needed for read/write, no?


It's not good to have a function which only implements the functionality 
partially.  It can later be misused.


You can pass the page-fault-error-code instead of the write parameter, I 
think it will be simpler.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path

2011-06-29 Thread Avi Kivity

On 06/29/2011 02:09 PM, Xiao Guangrong wrote:

On 06/29/2011 04:48 PM, Avi Kivity wrote:
  On 06/22/2011 05:31 PM, Xiao Guangrong wrote:
  If the page fault is caused by mmio, we can cache the mmio info, later, we 
do
  not need to walk guest page table and quickly know it is a mmio fault while 
we
  emulate the mmio instruction

  Does this work if the mmio spans two pages?


If the mmio spans two pages, we already split the emulation into two parts,
and the mmio cache info is only matched for one page, so i thinks it works
well :-)


Ok, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 05:16 PM, Avi Kivity wrote:
 On 06/22/2011 05:35 PM, Xiao Guangrong wrote:
 Use rcu to protect shadow pages table to be freed, so we can safely walk it,
 it should run fastly and is needed by mmio page fault

 
   static void kvm_mmu_commit_zap_page(struct kvm *kvm,
   struct list_head *invalid_list)
   {
 @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

   kvm_flush_remote_tlbs(kvm);

 +if (atomic_read(kvm-arch.reader_counter)) {
 +kvm_mmu_isolate_pages(invalid_list);
 +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 +list_del_init(invalid_list);
 +call_rcu(sp-rcu, free_pages_rcu);
 +return;
 +}
 +
 
 I think we should do this unconditionally.  The cost of ping-ponging the 
 shared cache line containing reader_counter will increase with large smp 
 counts.  On the other hand, zap_page is very rare, so it can be a little 
 slower.  Also, less code paths = easier to understand.
 

On soft mmu, zap_page is very frequently, it can cause performance regression 
in my test.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-29 Thread Avi Kivity

On 06/29/2011 02:16 PM, Xiao Guangrong wrote:

  @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

kvm_flush_remote_tlbs(kvm);

  +if (atomic_read(kvm-arch.reader_counter)) {
  +kvm_mmu_isolate_pages(invalid_list);
  +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
  +list_del_init(invalid_list);
  +call_rcu(sp-rcu, free_pages_rcu);
  +return;
  +}
  +

  I think we should do this unconditionally.  The cost of ping-ponging the 
shared cache line containing reader_counter will increase with large smp counts.  
On the other hand, zap_page is very rare, so it can be a little slower.  Also, 
less code paths = easier to understand.


On soft mmu, zap_page is very frequently, it can cause performance regression 
in my test.


Any idea what the cause of the regression is?  It seems to me that 
simply deferring freeing shouldn't have a large impact.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary

2011-06-29 Thread Avi Kivity

On 06/29/2011 01:53 PM, Xiao Guangrong wrote:

On 06/29/2011 04:21 PM, Avi Kivity wrote:


  -if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
  -== X86EMUL_CONTINUE)
  +if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
return X86EMUL_CONTINUE;

  This doesn't perform the cpl check.


Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl
is checked in this function, it is not enough?


You are right, it is enough.  I don't know how I missed it.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 07:09 PM, Avi Kivity wrote:
 On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
 On 06/29/2011 04:24 PM, Avi Kivity wrote:

   +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
   +   gpa_t *gpa, struct x86_exception *exception,
   +   bool write)
   +{
   +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 
  0;
   +
   +if (write)
   +access |= PFERR_WRITE_MASK;
 
   Needs fetch as well so NX/SMEP can work.
 

 This function is only used by read/write emulator, execute permission is
 not needed for read/write, no?
 
 It's not good to have a function which only implements the functionality 
 partially.  It can later be misused.
 
 You can pass the page-fault-error-code instead of the write parameter, I 
 think it will be simpler.
 

Actually, we will get the cache mmio info in this function, i think it is pure 
waste for other
access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Avi Kivity

On 06/29/2011 02:26 PM, Xiao Guangrong wrote:

On 06/29/2011 07:09 PM, Avi Kivity wrote:
  On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
  On 06/29/2011 04:24 PM, Avi Kivity wrote:

 +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 +   gpa_t *gpa, struct x86_exception *exception,
 +   bool write)
 +{
 +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK 
: 0;
 +
 +if (write)
 +access |= PFERR_WRITE_MASK;
  
 Needs fetch as well so NX/SMEP can work.
  

  This function is only used by read/write emulator, execute permission is
  not needed for read/write, no?

  It's not good to have a function which only implements the functionality 
partially.  It can later be misused.

  You can pass the page-fault-error-code instead of the write parameter, I 
think it will be simpler.


Actually, we will get the cache mmio info in this function, i think it is pure 
waste for other
access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio?


Not too happy, but ok.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: missing compat-ioctl for CDROM_DRIVE_STATUS + FDGETPRM

2011-06-29 Thread Johannes Stezenbach
On Fri, Jun 17, 2011 at 03:02:39PM +0200, Arnd Bergmann wrote:
 On Friday 17 June 2011 11:04:24 Johannes Stezenbach wrote:
  running even a simple qemu-img create -f qcow2 some.img 1G causes
  the following in dmesg on a Linux host with linux-2.6.39.1 x86_64 kernel
  and 32bit userspace:
  
  ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(5326){t:'S';sz:0} 
  arg(7fff) on some.img
  ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} 
  arg(fff77350) on some.img
  
  (The same happens when starting a qemu or kvm vm.)
  
  ioctl 5326 seems to be CDROM_DRIVE_STATUS,
  ioctl 801c0204 is FDGETPRM.  Both are used in
  qemu/block/raw-posix.c in cdrom_probe_device()
  and floppy_probe_device() respectively.
  
  FWIW, I'm using qemu/kvm from Debian unstable
  (qemu-0.14.0+dfsg-5.1, qemu-kvm-0.14.1+dfsg-1)
 
 Both are handled by the kernel for block devices, but not for regular
 files. The messages may be annoying but they are harmless. We could
 silence them either by checking if the file is actually a block device in
 qemu-img, or by adding a nop handler to the kernel for regular files.

Sorry for very slow reply.  I think qemu's use of these ioctls
to probe if the device is a cdrom or floppy is valid, so instead
of adding a stat() call to check for block device in qemu, I think
it is better to silence the warning in the kernel.

Do I get it right that just adding two IGNORE_IOCTL() to
the ioctl_pointer array in linux/fs/compat_ioctl.c is sufficient,
like in commit 3f001711?
I.e. these ioctls are handled for block devices earlier
in compat_sys_ioctl()?


Thanks,
Johannes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fix for Linux 3.0-rc5

2011-06-29 Thread Avi Kivity

Linus, please pull from

are available in the git repository at:
  git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.0

To receive a single KVM fix.  Emulated instructions which had both an 
immediate operand and an %rip-relative operand did not compute the 
effective address correctly; this is now fixed.


Avi Kivity (1):
  KVM: x86 emulator: fix %rip-relative addressing with immediate 
source operand


 arch/x86/kvm/emulate.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-29 Thread Xiao Guangrong
On 06/29/2011 07:18 PM, Avi Kivity wrote:
 On 06/29/2011 02:16 PM, Xiao Guangrong wrote:
   @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm 
  *kvm,
 
 kvm_flush_remote_tlbs(kvm);
 
   +if (atomic_read(kvm-arch.reader_counter)) {
   +kvm_mmu_isolate_pages(invalid_list);
   +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
   +list_del_init(invalid_list);
   +call_rcu(sp-rcu, free_pages_rcu);
   +return;
   +}
   +
 
   I think we should do this unconditionally.  The cost of ping-ponging the 
  shared cache line containing reader_counter will increase with large smp 
  counts.  On the other hand, zap_page is very rare, so it can be a little 
  slower.  Also, less code paths = easier to understand.
 

 On soft mmu, zap_page is very frequently, it can cause performance 
 regression in my test.
 
 Any idea what the cause of the regression is?  It seems to me that simply 
 deferring freeing shouldn't have a large impact.
 

I guess it is because the page is freed too frequently, i have done the test, 
it shows
about 3219 pages is freed per second

Kernbench performance comparing:

the origin way: 3m27.723
free all shadow page in rcu context: 3m30.519
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-29 Thread Gleb Natapov
On Wed, Jun 29, 2011 at 02:26:14PM +0300, Avi Kivity wrote:
 On 06/29/2011 02:26 PM, Xiao Guangrong wrote:
 On 06/29/2011 07:09 PM, Avi Kivity wrote:
   On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
   On 06/29/2011 04:24 PM, Avi Kivity wrote:
 
  +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long 
  gva,
  +   gpa_t *gpa, struct x86_exception *exception,
  +   bool write)
  +{
  +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? 
  PFERR_USER_MASK : 0;
  +
  +if (write)
  +access |= PFERR_WRITE_MASK;
   
  Needs fetch as well so NX/SMEP can work.
   
 
   This function is only used by read/write emulator, execute permission is
   not needed for read/write, no?
 
   It's not good to have a function which only implements the functionality 
  partially.  It can later be misused.
 
   You can pass the page-fault-error-code instead of the write parameter, I 
  think it will be simpler.
 
 
 Actually, we will get the cache mmio info in this function, i think it is 
 pure waste for other
 access execpt mmio, what about change the function name to 
 vcpu_gva_to_gpa_mmio?
 
 Not too happy, but ok.
 
I do plan to add fetching from MMIO.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate

2011-06-29 Thread Josh Boyer
On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote:
 Documentation/virtual/kvm/api.txt   |   35 +++
 arch/powerpc/include/asm/kvm.h  |   15 +++
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/powerpc.c  |   28 
 include/linux/kvm.h |1 +
 5 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index b0e4b9c..3ab012c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual 
machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.

+4.64 KVM_PPC_SET_PLATFORM
+
+Capability: none
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_ppc_set_platform (in)
+Returns: 0, or -1 on error
+
+This is used by userspace to tell KVM what sort of platform it should
+emulate.  The return value of the ioctl tells userspace whether the
+emulation it is requesting is supported by KVM.
+
+struct kvm_ppc_set_platform {
+  __u16 platform; /* defines the OS/hypervisor ABI */
+  __u16 guest_arch;   /* e.g. decimal 206 for v2.06 */
+  __u32 flags;
+};
+
+/* Values for platform */
+#define KVM_PPC_PV_NONE   0   /* bare-metal, 
non-paravirtualized */
+#define KVM_PPC_PV_KVM1   /* as defined in kvm_para.h */
+#define KVM_PPC_PV_SPAPR  2   /* IBM Server PAPR (a la PowerVM) */
+
+/* Values for flags */
+#define KVM_PPC_CROSS_ARCH1   /* guest architecture != host */
+
+The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a
+sufficiently different architecture to the host that the guest cannot
+be permitted to use supervisor mode.  For example, if the host is a
+64-bit machine and the guest is a 32-bit machine, then this bit should
+be set.

This makes me wonder if a similar thing might eventually be usable for
running an i686 or x32 guest on an x86_64 KVM host.  I have no idea if
that is even theoretically possible, but if it is it might be better to
rename the ioctl to be architecture agnostic.

josh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate

2011-06-29 Thread Alexander Graf

On 29.06.2011, at 13:53, Josh Boyer wrote:

 On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote:
 Documentation/virtual/kvm/api.txt   |   35 
 +++
 arch/powerpc/include/asm/kvm.h  |   15 +++
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/powerpc.c  |   28 
 include/linux/kvm.h |1 +
 5 files changed, 80 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index b0e4b9c..3ab012c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual 
 machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
 +4.64 KVM_PPC_SET_PLATFORM
 +
 +Capability: none
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: struct kvm_ppc_set_platform (in)
 +Returns: 0, or -1 on error
 +
 +This is used by userspace to tell KVM what sort of platform it should
 +emulate.  The return value of the ioctl tells userspace whether the
 +emulation it is requesting is supported by KVM.
 +
 +struct kvm_ppc_set_platform {
 +__u16 platform; /* defines the OS/hypervisor ABI */
 +__u16 guest_arch;   /* e.g. decimal 206 for v2.06 */
 +__u32 flags;
 +};
 +
 +/* Values for platform */
 +#define KVM_PPC_PV_NONE 0   /* bare-metal, 
 non-paravirtualized */
 +#define KVM_PPC_PV_KVM  1   /* as defined in kvm_para.h */
 +#define KVM_PPC_PV_SPAPR2   /* IBM Server PAPR (a la PowerVM) */
 +
 +/* Values for flags */
 +#define KVM_PPC_CROSS_ARCH  1   /* guest architecture != host */
 +
 +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a
 +sufficiently different architecture to the host that the guest cannot
 +be permitted to use supervisor mode.  For example, if the host is a
 +64-bit machine and the guest is a 32-bit machine, then this bit should
 +be set.
 
 This makes me wonder if a similar thing might eventually be usable for
 running an i686 or x32 guest on an x86_64 KVM host.  I have no idea if
 that is even theoretically possible, but if it is it might be better to
 rename the ioctl to be architecture agnostic.

On x86 this is not required unless we want to virtualize pre-CPUID CPUs. 
Everything as of Pentium has a full bitmap of feature capabilities that KVM 
gets from user space, including information such as Can we do 64-bit mode?.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate

2011-06-29 Thread Josh Boyer
On Wed, Jun 29, 2011 at 01:56:16PM +0200, Alexander Graf wrote:

On 29.06.2011, at 13:53, Josh Boyer wrote:

 On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote:
 Documentation/virtual/kvm/api.txt   |   35 
 +++
 arch/powerpc/include/asm/kvm.h  |   15 +++
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/powerpc.c  |   28 
 include/linux/kvm.h |1 +
 5 files changed, 80 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index b0e4b9c..3ab012c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all 
 virtual machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
 +4.64 KVM_PPC_SET_PLATFORM
 +
 +Capability: none
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: struct kvm_ppc_set_platform (in)
 +Returns: 0, or -1 on error
 +
 +This is used by userspace to tell KVM what sort of platform it should
 +emulate.  The return value of the ioctl tells userspace whether the
 +emulation it is requesting is supported by KVM.
 +
 +struct kvm_ppc_set_platform {
 +   __u16 platform; /* defines the OS/hypervisor ABI */
 +   __u16 guest_arch;   /* e.g. decimal 206 for v2.06 */
 +   __u32 flags;
 +};
 +
 +/* Values for platform */
 +#define KVM_PPC_PV_NONE0   /* bare-metal, 
 non-paravirtualized */
 +#define KVM_PPC_PV_KVM 1   /* as defined in kvm_para.h */
 +#define KVM_PPC_PV_SPAPR   2   /* IBM Server PAPR (a la PowerVM) */
 +
 +/* Values for flags */
 +#define KVM_PPC_CROSS_ARCH 1   /* guest architecture != host */
 +
 +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a
 +sufficiently different architecture to the host that the guest cannot
 +be permitted to use supervisor mode.  For example, if the host is a
 +64-bit machine and the guest is a 32-bit machine, then this bit should
 +be set.
 
 This makes me wonder if a similar thing might eventually be usable for
 running an i686 or x32 guest on an x86_64 KVM host.  I have no idea if
 that is even theoretically possible, but if it is it might be better to
 rename the ioctl to be architecture agnostic.

On x86 this is not required unless we want to virtualize pre-CPUID CPUs. 
Everything as of Pentium has a full bitmap of feature capabilities that KVM 
gets from user space, including information such as Can we do 64-bit mode?.

Ah.  Thank you for the explanation.

josh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] perf support for amd guest/host-only bits v2

2011-06-29 Thread Paul Mackerras
On Wed, Jun 29, 2011 at 11:02:54AM +0200, Peter Zijlstra wrote:
 On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote:
  On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote:
   this is the second version of the patch-set to support the AMD
   guest-/host only bits in the performance counter MSRs. Due to lack of
   time I havn't looked into emulating support for this feature on Intel or
   other architectures, but the other comments should be worked in. The
   changes to v1 include:
   
 * Rebased patches to v3.0-rc3
 * Allow exclude_guest and exclude_host set at the same time
 * Reworked event-parse logic for the new exclude-bits
 * Only count guest-events per default from perf-kvm
  
  Hi Peter, Ingo,
  
  have you had a chance to look at this patch-set? Are any changes
  required?
 
 I would feel a lot more comfortable by having it implemented on all of
 x86 as well as at least one !x86 platform. Avi graciously volunteered
 for the Intel bits.  
 
 Paulus, I hear from benh that you're also responsible for the ppc-kvm
 bits, could you possibly find some time to implement this feature for
 ppc?

I'll have a look at it, but I don't know how quickly I'll be able to
produce a patch.

We have two styles of KVM on PowerPC (at least as far as server
processors are concerned), one where the guest runs entirely in
usermode and the privileged facilities are emulated, and another that
uses hypervisor mode in the host and can allow the guest to use
supervisor mode.  In the latter case, the PMU is considered a guest
resource, that is, the hardware allows the guest to manipulate the PMU
directly, and PMU interrupts go directly to the guest.  In that mode
it's not really possible to count or profile guest activity from the
host.  There are some hypervisor-only counters in the PMU but they
have limited event selection compared to the counters available to the
guest.

Paul.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-29 Thread Avi Kivity

On 06/29/2011 02:50 PM, Xiao Guangrong wrote:

  
 I think we should do this unconditionally.  The cost of ping-ponging 
the shared cache line containing reader_counter will increase with large smp counts.  On 
the other hand, zap_page is very rare, so it can be a little slower.  Also, less code 
paths = easier to understand.
  

  On soft mmu, zap_page is very frequently, it can cause performance 
regression in my test.

  Any idea what the cause of the regression is?  It seems to me that simply 
deferring freeing shouldn't have a large impact.


I guess it is because the page is freed too frequently, i have done the test, 
it shows
about 3219 pages is freed per second

Kernbench performance comparing:

the origin way: 3m27.723
free all shadow page in rcu context: 3m30.519


I don't recall seeing such a high free rate.  Who is doing all this zapping?

You may be able to find out with the function tracer + call graph.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >