[PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time
This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/paravirt.h |9 + arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/paravirt.c|9 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ebbc4d8..a7d2db9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void) return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + +static inline u64 paravirt_steal_clock(int cpu) +{ + return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +} + static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8288509..2c76521 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -89,6 +89,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); + unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 869e1ae..613a793 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr) __native_flush_tlb_single(addr); } +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; + +static u64 native_steal_clock(int cpu) +{ + return 0; +} + /* These are in entry.S */ extern void native_iret(void); extern void native_irq_enable_sysexit(void); @@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = { struct pv_time_ops pv_time_ops = { .sched_clock = native_sched_clock, + .steal_clock = native_steal_clock, }; struct pv_irq_ops pv_irq_ops = { -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_para.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index a427bf7..d6cd79b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -30,6 +30,7 @@ #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define KVM_MSR_ENABLED 1 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for June 28
On Wed, Jun 29, 2011 at 11:08:23AM +0100, Stefan Hajnoczi wrote: On Wed, Jun 29, 2011 at 8:57 AM, Kevin Wolf kw...@redhat.com wrote: Am 28.06.2011 21:41, schrieb Marcelo Tosatti: stream -- 1) base - remote 2) base - remote - local 3) base - local local image is always valid. Requires backing file support. With the above, this restriction wouldn't apply any more. Also I don't think we should mix approaches. Either both block copy and image streaming use backing files, or none of them do. Mixing means duplicating more code, and even worse, that you can't stop a block copy in the middle and continue with streaming (which I believe is a really valuable feature to have). Here is how the image streaming feature is used from HMP/QMP: The guest is running from an image file with a backing file. The aim is to pull the data from the backing file and populate the image file so that the dependency on the backing file can be eliminated. 1. Start a background streaming operation: (qemu) block_stream -a ide0-hd 2. Check the status of the operation: (qemu) info block-stream Streaming device ide0-hd: Completed 512 of 34359738368 bytes 3. The status changes when the operation completes: (qemu) info block-stream No active stream On completion the image file no longer has a backing file dependency. When streaming completes QEMU updates the image file metadata to indicate that no backing file is used. The QMP interface is similar but provides QMP events to signal streaming completion and failure. Polling to query the streaming status is only used when the management application wishes to refresh progress information. If guest execution is interrupted by a power failure or QEMU crash, then the image file is still valid but streaming may be incomplete. When QEMU is launched again the block_stream command can be issued to resume streaming. In the future we could add a 'base' argument to block_stream. If base is specified then data contained in the base image will not be copied. This is a present requirement. This can be used to merge data from an intermediate image without merging the base image. When streaming completes the backing file will be set to the base image. The backing file relationship would typically look like this: 1. Before block_stream -a -b base.img ide0-hd completion: base.img - sn1 - ... - ide0-hd.qed 2. After streaming completes: base.img - ide0-hd.qed This describes the image streaming use cases that I, Adam, and Anthony propose to support. In the course of the discussion we've sometimes been distracted with the internals of what a unified live block copy/image streaming implementation should do. I wanted to post this summary of image streaming to refocus us on the use case and the APIs that users will see. Stefan OK, with an external COW file for formats that do not support it the interface can be similar. Also there is no need to mirror writes, no switch operation, always use destination image. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Preparatory perf patches for KVM PMU support
The following three patches pave the way for KVM in-guest performance monitoring. One is a perf API improvement, another fixes the constraints for the version 1 architectural PMU (which we will emulate), and the third adds an export that KVM will use. Please consider for merging; this will make further work on the KVM PMU easier. Avi Kivity (3): perf: add context field to perf_event x86, perf: add constraints for architectural PMU v1 perf: export perf_event_refresh() to modules arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/cpu/perf_event_intel.c | 23 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |9 - kernel/events/core.c| 24 +--- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 13 files changed, 69 insertions(+), 26 deletions(-) -- 1.7.5.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] perf: add context field to perf_event
The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com --- arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |4 +++- kernel/events/core.c| 21 +++-- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 12 files changed, 44 insertions(+), 20 deletions(-) diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c index 9726006..4911c94 100644 --- a/arch/arm/kernel/ptrace.c +++ b/arch/arm/kernel/ptrace.c @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type) attr.bp_type= type; attr.disabled = 1; - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk); + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL, + tsk); } static int ptrace_gethbpregs(struct task_struct *tsk, long num, diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index cb22024..5249308 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, attr.bp_type); thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr, - ptrace_triggered, task); + ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; ptrace_put_breakpoints(task); diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c index 3d7b209..930312f 100644 --- a/arch/sh/kernel/ptrace_32.c +++ b/arch/sh/kernel/ptrace_32.c @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr) attr.bp_len = HW_BREAKPOINT_LEN_2; attr.bp_type = HW_BREAKPOINT_R; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, +NULL, tsk); if (IS_ERR(bp)) return PTR_ERR(bp); diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 5f9ecff..473ab53 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -638,7 +638,7 @@ void kgdb_arch_late(void) for (i = 0; i HBP_NUM; i++) { if (breakinfo[i].pev) continue; - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL); + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, NULL); if (IS_ERR((void * __force)breakinfo[i].pev)) { printk(KERN_ERR kgdb: Could not allocate hw breakpoints\nDisabling the kernel debugger\n); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 807c2a2..28092ae 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr, attr.bp_type = HW_BREAKPOINT_W; attr.disabled = 1; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, +NULL, tsk); /* * CHECKME: the previous code returned -EIO if the addr wasn't diff --git a/drivers/oprofile/oprofile_perf.c b/drivers/oprofile/oprofile_perf.c index 9046f7b..59acf9e 100644 --- a/drivers/oprofile/oprofile_perf.c +++ b/drivers/oprofile/oprofile_perf.c @@ -79,7 +79,7 @@ static int op_create_counter(int cpu, int event) pevent = perf_event_create_kernel_counter(counter_config[event].attr, cpu, NULL, - op_overflow_handler); +
[PATCH 3/3] perf: export perf_event_refresh() to modules
KVM needs one-shot samples, since a PMC programmed to -X will fire after X events and then again after 2^40 events (i.e. variable period). Signed-off-by: Avi Kivity a...@redhat.com --- include/linux/perf_event.h |5 + kernel/events/core.c |3 ++- 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 40264b5..91342ac 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -973,6 +973,7 @@ extern void perf_pmu_disable(struct pmu *pmu); extern void perf_pmu_enable(struct pmu *pmu); extern int perf_event_task_disable(void); extern int perf_event_task_enable(void); +extern int perf_event_refresh(struct perf_event *event, int refresh); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * @@ -1168,6 +1169,10 @@ static inline void perf_event_delayed_put(struct task_struct *task) { } static inline void perf_event_print_debug(void) { } static inline int perf_event_task_disable(void) { return -EINVAL; } static inline int perf_event_task_enable(void) { return -EINVAL; } +static inline int perf_event_refresh(struct perf_event *event, int refresh) +{ + return -EINVAL; +} static inline void perf_sw_event(u32 event_id, u64 nr, int nmi, diff --git a/kernel/events/core.c b/kernel/events/core.c index 6dd4819..f69cc9f 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1739,7 +1739,7 @@ out: raw_spin_unlock_irq(ctx-lock); } -static int perf_event_refresh(struct perf_event *event, int refresh) +int perf_event_refresh(struct perf_event *event, int refresh) { /* * not supported on inherited events @@ -1752,6 +1752,7 @@ static int perf_event_refresh(struct perf_event *event, int refresh) return 0; } +EXPORT_SYMBOL_GPL(perf_event_refresh); static void ctx_sched_out(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx, -- 1.7.5.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] x86, perf: add constraints for architectural PMU v1
The v1 PMU does not have any fixed counters. Using the v2 constraints, which do have fixed counters, causes an additional choice to be present in the weight calculation, but not when actually scheduling the event, leading to an event being not scheduled at all. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kernel/cpu/perf_event_intel.c | 23 ++- 1 files changed, 18 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 41178c8..b46b70e 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -137,6 +137,11 @@ static struct event_constraint intel_westmere_percore_constraints[] __read_mostl EVENT_CONSTRAINT_END }; +static struct event_constraint intel_v1_event_constraints[] __read_mostly = +{ + EVENT_CONSTRAINT_END +}; + static struct event_constraint intel_gen_event_constraints[] __read_mostly = { FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */ @@ -1512,11 +1517,19 @@ static __init int intel_pmu_init(void) break; default: - /* -* default constraints for v2 and up -*/ - x86_pmu.event_constraints = intel_gen_event_constraints; - pr_cont(generic architected perfmon, ); + switch (x86_pmu.version) { + case 1: + x86_pmu.event_constraints = intel_v1_event_constraints; + pr_cont(generic architected perfmon v1, ); + break; + default: + /* +* default constraints for v2 and up +*/ + x86_pmu.event_constraints = intel_gen_event_constraints; + pr_cont(generic architected perfmon, ); + break; + } } return 0; } -- 1.7.5.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm monitor socket - connection refused
29.06.2011 19:20, Iordan Iordanov wrote: On 06/28/11 18:29, Michael Tokarev wrote: The process listening on this socket no longer exist, it finished. With this command line it should stay in foreground till finished (there's no -daemonize etc), so you should see error messages if any. The kvm command was backgrounded, not -daemonize(d). It was still running, and I was accessing the VM via VNC. So kvm was running at the time you tried to access the mointor. How about checking who is actually listening on this socket before asking? I thought it's the kvm process that listens on the socket. I haven't seen other processes spun off by kvm until now. Is that not the case? It is the kvm process that listens on the socket, it spawns no other processes. The only other explanation I can think of is that you tried to run two instances of kvm, and when second instance initialized it re-created the monitor socket but failed later (eg, when initin network or something else) and exited, but left the stray socket (JFYI, you can remove a unix-domain socket where some process is listening, and create another - that one will really be different socket, even if named the same way, -- just like you can re-create a plain file the same way). In any way, there hasn't been any problems/bugs in that area for ages. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may be Jason could find some use of it. --- arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |4 +++- kernel/events/core.c| 21 +++-- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 12 files changed, 44 insertions(+), 20 deletions(-) diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c index 9726006..4911c94 100644 --- a/arch/arm/kernel/ptrace.c +++ b/arch/arm/kernel/ptrace.c @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type) attr.bp_type= type; attr.disabled = 1; - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk); + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL, +tsk); } static int ptrace_gethbpregs(struct task_struct *tsk, long num, diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index cb22024..5249308 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, attr.bp_type); thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr, - ptrace_triggered, task); +ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; ptrace_put_breakpoints(task); diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c index 3d7b209..930312f 100644 --- a/arch/sh/kernel/ptrace_32.c +++ b/arch/sh/kernel/ptrace_32.c @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr) attr.bp_len = HW_BREAKPOINT_LEN_2; attr.bp_type = HW_BREAKPOINT_R; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); if (IS_ERR(bp)) return PTR_ERR(bp); diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 5f9ecff..473ab53 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -638,7 +638,7 @@ void kgdb_arch_late(void) for (i = 0; i HBP_NUM; i++) { if (breakinfo[i].pev) continue; - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL); + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, NULL); if (IS_ERR((void * __force)breakinfo[i].pev)) { printk(KERN_ERR kgdb: Could not allocate hw breakpoints\nDisabling the kernel debugger\n); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 807c2a2..28092ae 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr, attr.bp_type = HW_BREAKPOINT_W; attr.disabled = 1; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); /* * CHECKME: the previous code returned -EIO if the addr wasn't diff --git a/drivers/oprofile/oprofile_perf.c
Re: [PATCH 1/3] perf: add context field to perf_event
On 06/29/2011 07:08 PM, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivitya...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Right, I noticed that while writing the patch. Perhaps it can help in arm too, adding Will in Cc. But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may be Jason could find some use of it. I think an API should not require its users to iterate in their callbacks, even if it doesn't affect current users for some reason. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] kvm tools: Don't dynamically allocate threadpool jobs
To allow efficient use of shorter-term threadpool jobs, don't allocate them dynamically upon creation. Instead, store them within 'job' structures. This will prevent some overhead creating/destroying jobs which live for a short time. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/threadpool.h | 29 ++--- tools/kvm/threadpool.c | 30 ++ tools/kvm/virtio/9p.c | 12 ++-- tools/kvm/virtio/blk.c |8 tools/kvm/virtio/console.c | 10 +- tools/kvm/virtio/rng.c | 16 6 files changed, 51 insertions(+), 54 deletions(-) diff --git a/tools/kvm/include/kvm/threadpool.h b/tools/kvm/include/kvm/threadpool.h index 62826a6..768239f 100644 --- a/tools/kvm/include/kvm/threadpool.h +++ b/tools/kvm/include/kvm/threadpool.h @@ -1,14 +1,37 @@ #ifndef KVM__THREADPOOL_H #define KVM__THREADPOOL_H +#include kvm/mutex.h + +#include linux/list.h + struct kvm; typedef void (*kvm_thread_callback_fn_t)(struct kvm *kvm, void *data); -int thread_pool__init(unsigned long thread_count); +struct thread_pool__job { + kvm_thread_callback_fn_tcallback; + struct kvm *kvm; + void*data; + + int signalcount; + pthread_mutex_t mutex; -void *thread_pool__add_job(struct kvm *kvm, kvm_thread_callback_fn_t callback, void *data); + struct list_headqueue; +}; + +static inline void thread_pool__init_job(struct thread_pool__job *job, struct kvm *kvm, kvm_thread_callback_fn_t callback, void *data) +{ + *job = (struct thread_pool__job) { + .kvm= kvm, + .callback = callback, + .data = data, + .mutex = PTHREAD_MUTEX_INITIALIZER, + }; +} + +int thread_pool__init(unsigned long thread_count); -void thread_pool__do_job(void *job); +void thread_pool__do_job(struct thread_pool__job *job); #endif diff --git a/tools/kvm/threadpool.c b/tools/kvm/threadpool.c index 2db02184..fdc5fa7 100644 --- a/tools/kvm/threadpool.c +++ b/tools/kvm/threadpool.c @@ -6,17 +6,6 @@ #include pthread.h #include stdbool.h -struct thread_pool__job { - kvm_thread_callback_fn_tcallback; - struct kvm *kvm; - void*data; - - int signalcount; - pthread_mutex_t mutex; - - struct list_headqueue; -}; - static pthread_mutex_t job_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_mutex_t thread_mutex= PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t job_cond= PTHREAD_COND_INITIALIZER; @@ -139,26 +128,11 @@ int thread_pool__init(unsigned long thread_count) return i; } -void *thread_pool__add_job(struct kvm *kvm, - kvm_thread_callback_fn_t callback, void *data) -{ - struct thread_pool__job *job = calloc(1, sizeof(*job)); - - *job = (struct thread_pool__job) { - .kvm= kvm, - .data = data, - .callback = callback, - .mutex = PTHREAD_MUTEX_INITIALIZER - }; - - return job; -} - -void thread_pool__do_job(void *job) +void thread_pool__do_job(struct thread_pool__job *job) { struct thread_pool__job *jobinfo = job; - if (jobinfo == NULL) + if (jobinfo == NULL || jobinfo-callback == NULL) return; mutex_lock(jobinfo-mutex); diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c index d2d738d..b1a8c01 100644 --- a/tools/kvm/virtio/9p.c +++ b/tools/kvm/virtio/9p.c @@ -46,9 +46,9 @@ struct p9_fid { }; struct p9_dev_job { - struct virt_queue *vq; - struct p9_dev *p9dev; - void*job_id; + struct virt_queue *vq; + struct p9_dev *p9dev; + struct thread_pool__job job_id; }; struct p9_dev { @@ -696,7 +696,7 @@ static void ioevent_callback(struct kvm *kvm, void *param) { struct p9_dev_job *job = param; - thread_pool__do_job(job-job_id); + thread_pool__do_job(job-job_id); } static bool virtio_p9_pci_io_out(struct ioport *ioport, struct kvm *kvm, @@ -731,7 +731,7 @@ static bool virtio_p9_pci_io_out(struct ioport *ioport, struct kvm *kvm, .vq = queue, .p9dev = p9dev, }; - job-job_id = thread_pool__add_job(kvm, virtio_p9_do_io, job); + thread_pool__init_job(job-job_id, kvm, virtio_p9_do_io, job); ioevent = (struct ioevent) { .io_addr= p9dev-base_addr +
[PATCH 2/9] kvm tools: Process virtio-blk requests in parallel
Process multiple requests within a virtio-blk device's vring in parallel. Doing so may improve performance in cases when a request which can be completed using data which is present in a cache is queued after a request with un-cached data. bonnie++ benchmarks have shown a 6% improvement with reads, and 2% improvement in writes. Suggested-by: Anthony Liguori aligu...@us.ibm.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/virtio/blk.c | 74 --- 1 files changed, 38 insertions(+), 36 deletions(-) diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c index 1fdfc1e..f2a728c 100644 --- a/tools/kvm/virtio/blk.c +++ b/tools/kvm/virtio/blk.c @@ -31,6 +31,8 @@ struct blk_dev_job { struct virt_queue *vq; struct blk_dev *bdev; + struct ioveciov[VIRTIO_BLK_QUEUE_SIZE]; + u16 out, in, head; struct thread_pool__job job_id; }; @@ -51,7 +53,8 @@ struct blk_dev { u16 queue_selector; struct virt_queue vqs[NUM_VIRT_QUEUES]; - struct blk_dev_job jobs[NUM_VIRT_QUEUES]; + struct blk_dev_job jobs[VIRTIO_BLK_QUEUE_SIZE]; + u16 job_idx; struct pci_device_headerpci_hdr; }; @@ -118,20 +121,26 @@ static bool virtio_blk_pci_io_in(struct ioport *ioport, struct kvm *kvm, u16 por return ret; } -static bool virtio_blk_do_io_request(struct kvm *kvm, - struct blk_dev *bdev, - struct virt_queue *queue) +static void virtio_blk_do_io_request(struct kvm *kvm, void *param) { - struct iovec iov[VIRTIO_BLK_QUEUE_SIZE]; struct virtio_blk_outhdr *req; - ssize_t block_cnt = -1; - u16 out, in, head; u8 *status; + ssize_t block_cnt; + struct blk_dev_job *job; + struct blk_dev *bdev; + struct virt_queue *queue; + struct iovec *iov; + u16 out, in, head; - head= virt_queue__get_iov(queue, iov, out, in, kvm); - - /* head */ - req = iov[0].iov_base; + block_cnt = -1; + job = param; + bdev= job-bdev; + queue = job-vq; + iov = job-iov; + out = job-out; + in = job-in; + head= job-head; + req = iov[0].iov_base; switch (req-type) { case VIRTIO_BLK_T_IN: @@ -153,24 +162,27 @@ static bool virtio_blk_do_io_request(struct kvm *kvm, status = iov[out + in - 1].iov_base; *status = (block_cnt 0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; + mutex_lock(bdev-mutex); virt_queue__set_used_elem(queue, head, block_cnt); + mutex_unlock(bdev-mutex); - return true; + virt_queue__trigger_irq(queue, bdev-pci_hdr.irq_line, bdev-isr, kvm); } -static void virtio_blk_do_io(struct kvm *kvm, void *param) +static void virtio_blk_do_io(struct kvm *kvm, struct virt_queue *vq, struct blk_dev *bdev) { - struct blk_dev_job *job = param; - struct virt_queue *vq; - struct blk_dev *bdev; + while (virt_queue__available(vq)) { + struct blk_dev_job *job = bdev-jobs[bdev-job_idx++ % VIRTIO_BLK_QUEUE_SIZE]; - vq = job-vq; - bdev= job-bdev; - - while (virt_queue__available(vq)) - virtio_blk_do_io_request(kvm, bdev, vq); + *job= (struct blk_dev_job) { + .vq = vq, + .bdev = bdev, + }; + job-head = virt_queue__get_iov(vq, job-iov, job-out, job-in, kvm); - virt_queue__trigger_irq(vq, bdev-pci_hdr.irq_line, bdev-isr, kvm); + thread_pool__init_job(job-job_id, kvm, virtio_blk_do_io_request, job); + thread_pool__do_job(job-job_id); + } } static bool virtio_blk_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size, u32 count) @@ -190,24 +202,14 @@ static bool virtio_blk_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 po break; case VIRTIO_PCI_QUEUE_PFN: { struct virt_queue *queue; - struct blk_dev_job *job; void *p; - job = bdev-jobs[bdev-queue_selector]; - queue = bdev-vqs[bdev-queue_selector]; queue-pfn = ioport__read32(data); p = guest_pfn_to_host(kvm, queue-pfn); vring_init(queue-vring, VIRTIO_BLK_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN); -
[PATCH 3/9] kvm tools: Allow giving instance names
This will allow tracking instance names and sending commands to specific instances if multiple instances are running. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm.h |5 +++- tools/kvm/kvm-run.c |5 +++- tools/kvm/kvm.c | 55 ++- tools/kvm/term.c|3 ++ 4 files changed, 65 insertions(+), 3 deletions(-) diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h index 7d90d35..5ad3236 100644 --- a/tools/kvm/include/kvm/kvm.h +++ b/tools/kvm/include/kvm/kvm.h @@ -41,9 +41,11 @@ struct kvm { const char *vmlinux; struct disk_image **disks; int nr_disks; + + const char *name; }; -struct kvm *kvm__init(const char *kvm_dev, u64 ram_size); +struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, const char *name); int kvm__max_cpus(struct kvm *kvm); void kvm__init_ram(struct kvm *kvm); void kvm__delete(struct kvm *kvm); @@ -61,6 +63,7 @@ bool kvm__deregister_mmio(struct kvm *kvm, u64 phys_addr); void kvm__pause(void); void kvm__continue(void); void kvm__notify_paused(void); +int kvm__get_pid_by_instance(const char *name); /* * Debugging diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c index 0dece2d..a4abf76 100644 --- a/tools/kvm/kvm-run.c +++ b/tools/kvm/kvm-run.c @@ -69,6 +69,7 @@ static const char *network; static const char *host_ip_addr; static const char *guest_mac; static const char *script; +static const char *guest_name; static bool single_step; static bool readonly_image[MAX_DISK_IMAGES]; static bool vnc; @@ -132,6 +133,8 @@ static int virtio_9p_rootdir_parser(const struct option *opt, const char *arg, i static const struct option options[] = { OPT_GROUP(Basic options:), + OPT_STRING('\0', name, guest_name, guest name, + A name for the guest), OPT_INTEGER('c', cpus, nrcpus, Number of CPUs), OPT_U64('m', mem, ram_size, Virtual machine memory size in MiB.), OPT_CALLBACK('d', disk, NULL, image, Disk image, img_name_parser), @@ -546,7 +549,7 @@ int kvm_cmd_run(int argc, const char **argv, const char *prefix) term_init(); - kvm = kvm__init(kvm_dev, ram_size); + kvm = kvm__init(kvm_dev, ram_size, guest_name); ioeventfd__init(); diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index c400c70..4f723a6 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -113,11 +113,60 @@ static struct kvm *kvm__new(void) return kvm; } +static void kvm__create_pidfile(struct kvm *kvm) +{ + int fd; + char full_name[PATH_MAX], pid[10]; + + if (!kvm-name) + return; + + mkdir(/var/run/kvm-tools, 0777); + sprintf(full_name, /var/run/kvm-tools/%s.pid, kvm-name); + fd = open(full_name, O_CREAT | O_WRONLY, 0666); + sprintf(pid, %u\n, getpid()); + if (write(fd, pid, strlen(pid)) = 0) + die(Failed creating PID file); + close(fd); +} + +static void kvm__remove_pidfile(struct kvm *kvm) +{ + char full_name[PATH_MAX]; + + if (!kvm-name) + return; + + sprintf(full_name, /var/run/kvm-tools/%s.pid, kvm-name); + unlink(full_name); +} + +int kvm__get_pid_by_instance(const char *name) +{ + int fd, pid; + char pid_str[10], pid_file[PATH_MAX]; + + sprintf(pid_file, /var/run/kvm-tools/%s.pid, name); + fd = open(pid_file, O_RDONLY); + if (fd 0) + return -1; + + if (read(fd, pid_str, 10) == 0) + return -1; + + pid = atoi(pid_str); + if (pid 0) + return -1; + + return pid; +} + void kvm__delete(struct kvm *kvm) { kvm__stop_timer(kvm); munmap(kvm-ram_start, kvm-ram_size); + kvm__remove_pidfile(kvm); free(kvm); } @@ -237,7 +286,7 @@ int kvm__max_cpus(struct kvm *kvm) return ret; } -struct kvm *kvm__init(const char *kvm_dev, u64 ram_size) +struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, const char *name) { struct kvm_pit_config pit_config = { .flags = 0, }; struct kvm *kvm; @@ -300,6 +349,10 @@ struct kvm *kvm__init(const char *kvm_dev, u64 ram_size) if (ret 0) die_perror(KVM_CREATE_IRQCHIP ioctl); + kvm-name = name; + + kvm__create_pidfile(kvm); + return kvm; } diff --git a/tools/kvm/term.c b/tools/kvm/term.c index 9947223..a0cb03f 100644 --- a/tools/kvm/term.c +++ b/tools/kvm/term.c @@ -9,7 +9,9 @@ #include kvm/read-write.h #include kvm/term.h #include kvm/util.h +#include kvm/kvm.h +extern struct kvm *kvm; static struct termios orig_term; int term_escape_char = 0x01; /* ctrl-a is used for escape */ @@ -32,6 +34,7 @@ int term_getc(int who) if (term_got_escape) { term_got_escape = false; if (c == 'x') { +
[PATCH 4/9] kvm tools: Provide instance name when running 'kvm debug'
Instead of sending a signal to the first instance found, send it to a specific instance. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm-debug.c | 19 +++ 1 files changed, 15 insertions(+), 4 deletions(-) diff --git a/tools/kvm/kvm-debug.c b/tools/kvm/kvm-debug.c index 58782dd..432ae84 100644 --- a/tools/kvm/kvm-debug.c +++ b/tools/kvm/kvm-debug.c @@ -1,11 +1,22 @@ -#include stdio.h -#include string.h - #include kvm/util.h #include kvm/kvm-cmd.h #include kvm/kvm-debug.h +#include kvm/kvm.h + +#include stdio.h +#include string.h +#include signal.h int kvm_cmd_debug(int argc, const char **argv, const char *prefix) { - return system(kill -3 $(pidof kvm)); + int pid; + + if (argc != 1) + die(Usage: kvm debug [instance name]\n); + + pid = kvm__get_pid_by_instance(argv[0]); + if (pid 0) + die(Failed locating instance name); + + return kill(pid, SIGQUIT); } -- 1.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] kvm tools: Provide instance name when running 'kvm pause'
Instead of sending a signal to the first instance found, send it to a specific instance. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm-pause.c | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/tools/kvm/kvm-pause.c b/tools/kvm/kvm-pause.c index fdf8714..0cb6f29 100644 --- a/tools/kvm/kvm-pause.c +++ b/tools/kvm/kvm-pause.c @@ -5,9 +5,18 @@ #include kvm/util.h #include kvm/kvm-cmd.h #include kvm/kvm-pause.h +#include kvm/kvm.h int kvm_cmd_pause(int argc, const char **argv, const char *prefix) { - signal(SIGUSR2, SIG_IGN); - return system(kill -USR2 $(pidof kvm)); + int pid; + + if (argc != 1) + die(Usage: kvm debug [instance name]\n); + + pid = kvm__get_pid_by_instance(argv[0]); + if (pid 0) + die(Failed locating instance name); + + return kill(pid, SIGUSR2); } -- 1.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] kvm tools: Advise memory allocated for guest RAM as KSM mergable
Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 4f723a6..15bcf08 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -345,6 +345,8 @@ struct kvm *kvm__init(const char *kvm_dev, u64 ram_size, const char *name) if (kvm-ram_start == MAP_FAILED) die(out of memory); + madvise(kvm-ram_start, kvm-ram_size, MADV_MERGEABLE); + ret = ioctl(kvm-vm_fd, KVM_CREATE_IRQCHIP); if (ret 0) die_perror(KVM_CREATE_IRQCHIP ioctl); -- 1.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] kvm tools: Add virtio-balloon device
From the virtio spec: The virtio memory balloon device is a primitive device for managing guest memory: the device asks for a certain amount of memory, and the guest supplies it (or withdraws it, if the device has more than it asks for). This allows the guest to adapt to changes in allowance of underlying physical memory. To activate the virtio-balloon device run kvm tools with the '--balloon' command line parameter. Current implementation listens for two signals: - SIGKVMADDMEM: Adds 1M to the balloon driver (inflate). This will decrease available memory within the guest. - SIGKVMDELMEM: Remove 1M from the balloon driver (deflate). This will increase available memory within the guest. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/Makefile |1 + tools/kvm/include/kvm/kvm.h|3 + tools/kvm/include/kvm/virtio-balloon.h |8 + tools/kvm/include/kvm/virtio-pci-dev.h |1 + tools/kvm/kvm-run.c|6 + tools/kvm/virtio/balloon.c | 265 6 files changed, 284 insertions(+), 0 deletions(-) create mode 100644 tools/kvm/include/kvm/virtio-balloon.h create mode 100644 tools/kvm/virtio/balloon.c diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index d368c22..a1b2f4c 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -40,6 +40,7 @@ OBJS += virtio/console.o OBJS += virtio/core.o OBJS += virtio/net.o OBJS += virtio/rng.o +OBJS+= virtio/balloon.o OBJS += disk/blk.o OBJS += disk/qcow.o OBJS += disk/raw.o diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h index 5ad3236..1fdfcf7 100644 --- a/tools/kvm/include/kvm/kvm.h +++ b/tools/kvm/include/kvm/kvm.h @@ -6,6 +6,7 @@ #include stdbool.h #include linux/types.h #include time.h +#include signal.h #define KVM_NR_CPUS(255) @@ -17,6 +18,8 @@ #define SIGKVMEXIT (SIGRTMIN + 0) #define SIGKVMPAUSE(SIGRTMIN + 1) +#define SIGKVMADDMEM (SIGRTMIN + 2) +#define SIGKVMDELMEM (SIGRTMIN + 3) struct kvm { int sys_fd; /* For system ioctls(), i.e. /dev/kvm */ diff --git a/tools/kvm/include/kvm/virtio-balloon.h b/tools/kvm/include/kvm/virtio-balloon.h new file mode 100644 index 000..eb49fd4 --- /dev/null +++ b/tools/kvm/include/kvm/virtio-balloon.h @@ -0,0 +1,8 @@ +#ifndef KVM__BLN_VIRTIO_H +#define KVM__BLN_VIRTIO_H + +struct kvm; + +void virtio_bln__init(struct kvm *kvm); + +#endif /* KVM__BLN_VIRTIO_H */ diff --git a/tools/kvm/include/kvm/virtio-pci-dev.h b/tools/kvm/include/kvm/virtio-pci-dev.h index ca373df..4eee831 100644 --- a/tools/kvm/include/kvm/virtio-pci-dev.h +++ b/tools/kvm/include/kvm/virtio-pci-dev.h @@ -12,6 +12,7 @@ #define PCI_DEVICE_ID_VIRTIO_BLK 0x1001 #define PCI_DEVICE_ID_VIRTIO_CONSOLE 0x1003 #define PCI_DEVICE_ID_VIRTIO_RNG 0x1004 +#define PCI_DEVICE_ID_VIRTIO_BLN 0x1005 #define PCI_DEVICE_ID_VIRTIO_P90x1009 #define PCI_DEVICE_ID_VESA 0x2000 diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c index a4abf76..3b1d586 100644 --- a/tools/kvm/kvm-run.c +++ b/tools/kvm/kvm-run.c @@ -18,6 +18,7 @@ #include kvm/virtio-net.h #include kvm/virtio-console.h #include kvm/virtio-rng.h +#include kvm/virtio-balloon.h #include kvm/disk-image.h #include kvm/util.h #include kvm/pci.h @@ -74,6 +75,7 @@ static bool single_step; static bool readonly_image[MAX_DISK_IMAGES]; static bool vnc; static bool sdl; +static bool balloon; extern bool ioport_debug; extern int active_console; extern int debug_iodelay; @@ -145,6 +147,7 @@ static const struct option options[] = { OPT_STRING('\0', kvm-dev, kvm_dev, kvm-dev, KVM device file), OPT_CALLBACK('\0', virtio-9p, NULL, dirname,tag_name, Enable 9p over virtio, virtio_9p_rootdir_parser), + OPT_BOOLEAN('\0', balloon, balloon, Enable virtio balloon), OPT_BOOLEAN('\0', vnc, vnc, Enable VNC framebuffer), OPT_BOOLEAN('\0', sdl, sdl, Enable SDL framebuffer), @@ -629,6 +632,9 @@ int kvm_cmd_run(int argc, const char **argv, const char *prefix) while (virtio_rng--) virtio_rng__init(kvm); + if (balloon) + virtio_bln__init(kvm); + if (!network) network = DEFAULT_NETWORK; diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c new file mode 100644 index 000..ab9ccb7 --- /dev/null +++ b/tools/kvm/virtio/balloon.c @@ -0,0 +1,265 @@ +#include kvm/virtio-balloon.h + +#include kvm/virtio-pci-dev.h + +#include kvm/disk-image.h +#include kvm/virtio.h +#include kvm/ioport.h +#include kvm/util.h +#include kvm/kvm.h +#include kvm/pci.h +#include kvm/threadpool.h +#include kvm/irq.h +#include kvm/ioeventfd.h + +#include linux/virtio_ring.h +#include linux/virtio_balloon.h + +#include
[PATCH 8/9] kvm tools: Add 'kvm balloon' command
Add a command to allow easily inflate/deflate the balloon driver in running instances. Usage: kvm balloon [command] [instance name] [size] command is either inflate or deflate, and size is represented in MB. Target instance must be named (started with '--name'). Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/Makefile |1 + tools/kvm/include/kvm/kvm-balloon.h |6 ++ tools/kvm/kvm-balloon.c | 34 ++ tools/kvm/kvm-cmd.c | 12 +++- tools/kvm/virtio/balloon.c |8 5 files changed, 52 insertions(+), 9 deletions(-) create mode 100644 tools/kvm/include/kvm/kvm-balloon.h create mode 100644 tools/kvm/kvm-balloon.c diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index a1b2f4c..4823c77 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -50,6 +50,7 @@ OBJS += kvm-cmd.o OBJS += kvm-debug.o OBJS += kvm-help.o OBJS+= kvm-pause.o +OBJS+= kvm-balloon.o OBJS += kvm-run.o OBJS += mptable.o OBJS += rbtree.o diff --git a/tools/kvm/include/kvm/kvm-balloon.h b/tools/kvm/include/kvm/kvm-balloon.h new file mode 100644 index 000..f5f92b9 --- /dev/null +++ b/tools/kvm/include/kvm/kvm-balloon.h @@ -0,0 +1,6 @@ +#ifndef KVM__BALLOON_H +#define KVM__BALLOON_H + +int kvm_cmd_balloon(int argc, const char **argv, const char *prefix); + +#endif diff --git a/tools/kvm/kvm-balloon.c b/tools/kvm/kvm-balloon.c new file mode 100644 index 000..277cada --- /dev/null +++ b/tools/kvm/kvm-balloon.c @@ -0,0 +1,34 @@ +#include stdio.h +#include string.h +#include signal.h + +#include kvm/util.h +#include kvm/kvm-cmd.h +#include kvm/kvm-balloon.h +#include kvm/kvm.h + +int kvm_cmd_balloon(int argc, const char **argv, const char *prefix) +{ + int pid; + int amount, i; + int inflate = 0; + + if (argc != 3) + die(Usage: kvm balloon [command] [instance name] [amount]\n); + + pid = kvm__get_pid_by_instance(argv[1]); + if (pid 0) + die(Failed locating instance name); + + if (strcmp(argv[0], inflate) == 0) + inflate = 1; + else if (strcmp(argv[0], deflate)) + die(command can be either 'inflate' or 'deflate'); + + amount = atoi(argv[2]); + + for (i = 0; i amount; i++) + kill(pid, inflate ? SIGKVMADDMEM : SIGKVMDELMEM); + + return 0; +} diff --git a/tools/kvm/kvm-cmd.c b/tools/kvm/kvm-cmd.c index ffbc4ff..1598781 100644 --- a/tools/kvm/kvm-cmd.c +++ b/tools/kvm/kvm-cmd.c @@ -7,16 +7,18 @@ /* user defined header files */ #include kvm/kvm-debug.h #include kvm/kvm-pause.h +#include kvm/kvm-balloon.h #include kvm/kvm-help.h #include kvm/kvm-cmd.h #include kvm/kvm-run.h struct cmd_struct kvm_commands[] = { - { pause, kvm_cmd_pause, NULL, 0 }, - { debug, kvm_cmd_debug, NULL, 0 }, - { help, kvm_cmd_help, NULL, 0 }, - { run, kvm_cmd_run, kvm_run_help, 0 }, - { NULL,NULL, NULL, 0 }, + { pause, kvm_cmd_pause, NULL, 0 }, + { debug, kvm_cmd_debug, NULL, 0 }, + { balloon,kvm_cmd_balloon,NULL, 0 }, + { help, kvm_cmd_help, NULL, 0 }, + { run,kvm_cmd_run,kvm_run_help, 0 }, + { NULL, NULL, NULL, 0 }, }; /* diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c index ab9ccb7..854d04b 100644 --- a/tools/kvm/virtio/balloon.c +++ b/tools/kvm/virtio/balloon.c @@ -39,7 +39,7 @@ struct bln_dev { /* virtio queue */ u16 queue_selector; struct virt_queue vqs[NUM_VIRT_QUEUES]; - void*jobs[NUM_VIRT_QUEUES]; + struct thread_pool__job jobs[NUM_VIRT_QUEUES]; struct virtio_balloon_config config; }; @@ -174,13 +174,13 @@ static bool virtio_bln_pci_io_out(struct ioport *ioport, struct kvm *kvm, u16 po vring_init(queue-vring, VIRTIO_BLN_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN); - bdev.jobs[bdev.queue_selector] = thread_pool__add_job(kvm, virtio_bln_do_io, queue); + thread_pool__init_job(bdev.jobs[bdev.queue_selector], kvm, virtio_bln_do_io, queue); ioevent = (struct ioevent) { .io_addr= bdev.base_addr + VIRTIO_PCI_QUEUE_NOTIFY, .io_len = sizeof(u16), .fn = ioevent_callback, - .fn_ptr = bdev.jobs[bdev.queue_selector], + .fn_ptr = bdev.jobs[bdev.queue_selector], .datamatch = bdev.queue_selector, .fn_kvm = kvm, .fd =
[PATCH 9/9] kvm tools: Stop VCPUs before freeing struct kvm
Not stopping VCPUs before leads to seg faults and other errors due to synchronization between threads. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/term.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/tools/kvm/term.c b/tools/kvm/term.c index a0cb03f..2a3e1f0 100644 --- a/tools/kvm/term.c +++ b/tools/kvm/term.c @@ -10,6 +10,7 @@ #include kvm/term.h #include kvm/util.h #include kvm/kvm.h +#include kvm/kvm-cpu.h extern struct kvm *kvm; static struct termios orig_term; @@ -34,6 +35,7 @@ int term_getc(int who) if (term_got_escape) { term_got_escape = false; if (c == 'x') { + kvm_cpu__reboot(); kvm__delete(kvm); printf(\n # KVM session terminated.\n); exit(1); -- 1.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm monitor socket - connection refused
Hi Michael, On 06/29/11 11:52, Michael Tokarev wrote: The only other explanation I can think of is that you tried to run two instances of kvm, and when second instance initialized it re-created the monitor socket but failed later (eg, when initin network or something else) and exited, but left the stray socket (JFYI, you can remove a unix-domain socket where some process is listening, and create another - that one will really be different socket, even if named the same way, -- just like you can re-create a plain file the same way). This may have been what happened. I'll try to reproduce this scenario. Is there no way to prevent the accidental overwriting of a monitor socket that is still being used? I.e. is there no way for kvm to realize that the socket is in use and complain? In any way, there hasn't been any problems/bugs in that area for ages. This is what I was hoping to hear! :) Thanks! Iordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Any problem if I use ionice on KVM?
I keep running into a situation where a KVM guest will lock up on some kind of disk process it seems. System load goes way up but cpu % is relatively low based on a crond script collecting data before everything goes south. As a result, the host becoming unresponsive as well. Initially it appeared to be due to a routine maintenance script which I resolved with a combination of noatime and ionice on the script. However, now it appears that some other event/process is also cause a lock up at random points in time. It's practically impossible (or I'm too noob) to troubleshoot and figure out what exactly is causing this. So I'm wondering if it's safe to run ionice on the KVM process so that a runaway guest will not pull down the host with it. Which would perhaps in some ways allow me to try to figure out what is going on. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
On Wed, 29 Jun 2011, Glauber Costa wrote: This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net My mail provider seems to have dropped patch 1 of the series so I can't reply directly to it, please add my Tested-by there as well. Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 3/9] KVM-HDR: KVM Steal time implementation
On Wed, 29 Jun 2011, Glauber Costa wrote: To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 4/9] KVM-HV: KVM Steal time implementation
On Wed, 29 Jun 2011, Glauber Costa wrote: To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the hypervisor part for it. I am keeping it separate from the headers to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 5/9] KVM-HV: use schedstats to calculate steal time
On Wed, 29 Jun 2011, Glauber Costa wrote: SCHEDSTATS provide a precise source of information about time tasks spent on a runqueue, but not running (among other things). It is specially useful for the steal time implementation, because it doesn't record halt time at all. To avoid a hard dependency on schedstats, since it is possible one won't want to record statistics about all processes running, the previous method of time measurement on put/load vcpu is kept for !SCHEDSTATS. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net CC: Marcelo Tosatti mtosa...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time
On Wed, 29 Jun 2011, Glauber Costa wrote: This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting
On Wed, 29 Jun 2011, Glauber Costa wrote: This patch accounts steal time time in kernel/sched. I kept it from last proposal, because I still see advantages in it: Doing it here will give us easier access from scheduler variables such as the cpu rq. The next patch shows an example of usage for it. Since functions like account_idle_time() can be called from multiple places, not only account_process_tick(), steal time grabbing is repeated in each account function separatedely. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 8/9] KVM-GST: adjust scheduler cpu power
On Wed, 29 Jun 2011, Glauber Costa wrote: This is a first proposal for using steal time information to influence the scheduler. There are a lot of optimizations and fine grained adjustments to be done, but it is working reasonably so far for me (mostly) With this patch (and some host pinnings to demonstrate the situation), two vcpus with very different steal time (Say 80 % vs 1 %) will not get an even distribution of processes. This is a situation that can naturally arise, specially in overcommited scenarios. Previosly, the guest scheduler would wrongly think that all cpus have the same ability to run processes, lowering the overall throughput. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
Re: [PATCH v3 9/9] KVM-GST: KVM Steal time registration
On Wed, 29 Jun 2011, Glauber Costa wrote: Register steal time within KVM. Everytime we sample the steal time information, we update a local variable that tells what was the last time read. We then account the difference. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net Tested-by: Eric B Munson emun...@mgebm.net signature.asc Description: Digital signature
[PATCH] virt: Cleaning up debug messages
In order to make it easier for people to read KVM autotest logs, went through the virt module and the kvm test, removing some not overly useful debug messages and modified others. Some things that were modified: 1) Removed MAC address management messages 2) Removed ellipses from most of the debug messages, as they're unnecessary Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm.py |2 - client/virt/kvm_vm.py | 15 --- client/virt/virt_env_process.py | 50 ++ client/virt/virt_test_setup.py | 18 +++--- client/virt/virt_test_utils.py | 14 +- client/virt/virt_utils.py | 16 client/virt/virt_vm.py | 13 - 7 files changed, 57 insertions(+), 71 deletions(-) diff --git a/client/tests/kvm/kvm.py b/client/tests/kvm/kvm.py index 84c361e..c69ad46 100644 --- a/client/tests/kvm/kvm.py +++ b/client/tests/kvm/kvm.py @@ -45,8 +45,6 @@ class kvm(test.test): virt_utils.set_log_file_dir(self.debugdir) # Open the environment file -logging.info(Unpickling env. You may see some harmless error - messages.) env_filename = os.path.join(self.bindir, params.get(env, env)) env = virt_utils.Env(env_filename, self.env_version) diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py index b7afeeb..a2f22b4 100644 --- a/client/virt/kvm_vm.py +++ b/client/virt/kvm_vm.py @@ -393,9 +393,6 @@ class VM(virt_vm.BaseVM): qemu_binary = virt_utils.get_path(root_dir, params.get(qemu_binary, qemu)) -# Get the output of 'qemu -help' (log a message in case this call never -# returns or causes some other kind of trouble) -logging.debug(Getting output of 'qemu -help') help = commands.getoutput(%s -help % qemu_binary) # Start constructing the qemu command @@ -877,11 +874,11 @@ class VM(virt_vm.BaseVM): if self.is_dead(): return -logging.debug(Destroying VM with PID %s..., self.get_pid()) +logging.debug(Destroying VM with PID %s, self.get_pid()) if gracefully and self.params.get(shutdown_command): # Try to destroy with shell command -logging.debug(Trying to shutdown VM with shell command...) +logging.debug(Trying to shutdown VM with shell command) try: session = self.login() except (virt_utils.LoginError, virt_vm.VMError), e: @@ -891,7 +888,7 @@ class VM(virt_vm.BaseVM): # Send the shutdown command session.sendline(self.params.get(shutdown_command)) logging.debug(Shutdown command sent; waiting for VM - to go down...) + to go down) if virt_utils.wait_for(self.is_dead, 60, 1, 1): logging.debug(VM is down) return @@ -900,7 +897,7 @@ class VM(virt_vm.BaseVM): if self.monitor: # Try to destroy with a monitor command -logging.debug(Trying to kill VM with monitor command...) +logging.debug(Trying to kill VM with monitor command) try: self.monitor.quit() except kvm_monitor.MonitorError, e: @@ -912,8 +909,8 @@ class VM(virt_vm.BaseVM): return # If the VM isn't dead yet... -logging.debug(Cannot quit normally; sending a kill to close the - deal...) +logging.debug(Cannot quit normally, sending a kill to close the + deal) virt_utils.kill_process_tree(self.process.get_pid(), 9) # Wait for the VM to be really dead if virt_utils.wait_for(self.is_dead, 5, 0.5, 0.5): diff --git a/client/virt/virt_env_process.py b/client/virt/virt_env_process.py index b237ed2..b47a9a5 100644 --- a/client/virt/virt_env_process.py +++ b/client/virt/virt_env_process.py @@ -29,11 +29,10 @@ def preprocess_image(test, params): create_image = False if params.get(force_create_image) == yes: -logging.debug('force_create_image' specified; creating image...) +logging.debug(Param 'force_create_image' specified, creating image) create_image = True elif (params.get(create_image) == yes and not os.path.exists(image_filename)): -logging.debug(Creating image...) create_image = True if create_image and not virt_vm.create_image(params, test.bindir): @@ -50,10 +49,10 @@ def preprocess_vm(test, params, env, name): @param env: The environment (a dict-like object). @param name: The name of the VM object.
Re: [PATCH v2 00/11] KVM in-guest performance monitoring
On 06/13/2011 04:34 PM, Avi Kivity wrote: This patchset exposes an emulated version 1 architectural performance monitoring unit to KVM guests. The PMU is emulated using perf_events, so the host kernel can multiplex host-wide, host-user, and the guest on available resources. Caveats: - counters that have PMI (interrupt) enabled stop counting after the interrupt is signalled. This is because we need one-shot samples that keep counting, which perf doesn't support yet - some combinations of INV and CMASK are not supported - counters keep on counting in the host as well as the guest perf maintainers: please consider the first three patches for merging (the first two make sense even without the rest). If you're familiar with the Intel PMU, please review patch 5 as well - it effectively undoes all your work of abstracting the PMU into perf_events by unabstracting perf_events into what is hoped is a very similar PMU. v2: - don't pass perf_event handler context to the callback; extract it via the 'event' parameter instead - RDPMC emulation and interception - CR4.PCE emulation Peter, can you look at 1-3 please? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for June 28
Am 28.06.2011 21:41, schrieb Marcelo Tosatti: On Tue, Jun 28, 2011 at 02:38:15PM +0100, Stefan Hajnoczi wrote: On Mon, Jun 27, 2011 at 3:32 PM, Juan Quintela quint...@redhat.com wrote: Please send in any agenda items you are interested in covering. Live block copy and image streaming: * The differences between Marcelo and Kevin's approaches * Which approach to choose and who can help implement it After more thinking, i dislike the image metadata approach. Management must carry the information anyway, so its pointless to duplicate it inside an image format. After the discussion today, i think the internal mechanism and interface should be different for copy and stream: block copy -- With backing files: 1) base - sn1 - sn2 2) base - copy Without: 1) source 2) destination Copy is only valid after switch has been performed. Same interface and crash recovery characteristics for all image formats. If management wants to support continuation, it must specify blkcopy:sn2:copy on startup. We can use almost the same interface and still have an image that is always valid (assuming that you provide the right format on the command line, which is already a requirement today). base - sn1 - sn2 - copy.raw You just add the file name for an external COW file, like blkcopy:sn2:copy.raw:copy.cow (we can even have a default filename for HMP instead of requiring to specify it, like $IMAGE.cow) and if the destination doesn't support backing files by itself, blkcopy creates the COW overlay BlockDriverState that uses this file. No difference for management at all, except that it needs to allow access to another file. stream -- 1) base - remote 2) base - remote - local 3) base - local local image is always valid. Requires backing file support. With the above, this restriction wouldn't apply any more. Also I don't think we should mix approaches. Either both block copy and image streaming use backing files, or none of them do. Mixing means duplicating more code, and even worse, that you can't stop a block copy in the middle and continue with streaming (which I believe is a really valuable feature to have). Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
On 06/22/2011 05:29 PM, Xiao Guangrong wrote: If the range spans a boundary, the mmio access can be broke, fix it as write emulation. And we already get the guest physical address, so use it to read guest data directly to avoid walking guest page table again Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 41 - 1 files changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0b803f0..eb27be4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3944,14 +3944,13 @@ out: } EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system); -static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, - unsigned long addr, - void *val, - unsigned int bytes, - struct x86_exception *exception) +static int emulator_read_emulated_onepage(unsigned long addr, + void *val, + unsigned int bytes, + struct x86_exception *exception, + struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - gpa_t gpa; + gpa_t gpa; int handled; if (vcpu-mmio_read_completed) { @@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt, if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) goto mmio; - if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception) - == X86EMUL_CONTINUE) + if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) return X86EMUL_CONTINUE; This doesn't perform the cpl check. I suggest dropping this part for now and doing it later. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote: If a device uses more than one queue it is the responsibility of the device to ensure strict request ordering. Maybe I misunderstand - how can this be the responsibility of the device if the device does not get the information about the original ordering of the requests? For example, if the driver is crazy enough to put all write requests on one queue and all barriers on another one, how is the device supposed to ensure ordering? I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On 06/22/2011 05:29 PM, Xiao Guangrong wrote: Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it to cleanup the code between read emulation and write emulation Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 38 +- 1 files changed, 29 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eb27be4..c29ef96 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3944,6 +3944,27 @@ out: } EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system); +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ + u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + + if (write) + access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. + + *gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, gva, access, exception); -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/14/2011 10:39 AM, Hannes Reinecke wrote: If, however, we decide to expose some details about the backend, we could be using the values from the backend directly. EG we could be forwarding the SCSI target port identifier here (if backed by real hardware) or creating our own SAS-type identifier when backed by qemu block. Then we could just query the backend via a new command on the controlq (eg 'list target ports') and wouldn't have to worry about any protocol specific details here. Besides the controlq command, which I can certainly add, this is actually quite similar to what I had in mind (though my plan likely would not have worked because I was expecting hierarchical LUNs used uniformly). So, list target ports would return a set of LUN values to which you can send REPORT LUNS, or something like that? I suppose that if you're using real hardware as the backing storage the in-kernel target can provide that. For the QEMU backend I'd keep hierarchical LUNs, though of course one could add a FC or SAS bus to QEMU, each implementing its own identifier scheme. If I understand it correctly, it should remain possible to use a single host for both pass-through and emulated targets. Would you draft the command structure, so I can incorporate it into the spec? Of course, when doing so we would be lose the ability to freely remap LUNs. But then remapping LUNs doesn't gain you much imho. Plus you could always use qemu block backend here if you want to hide the details. And you could always use the QEMU block backend with scsi-generic if you want to remap LUNs, instead of true passthrough via the kernel target. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
On 06/22/2011 05:30 PM, Xiao Guangrong wrote: The operations of read emulation and write emulation are very similar, so we can abstract the operation of them, in larter patch, it is used to cleanup the same code Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 72 1 files changed, 72 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c29ef96..887714f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, return 1; } +struct read_write_emulator_ops { + int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val, + int bytes); + int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); + int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, + int bytes, void *val); + int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); + bool write; +}; Interesting! This structure combines two unrelated operations, though. One is the internals of the iteration on a virtual address that is split to various physical addresses. The other is the interaction with userspace on mmio exits. They should be split, but I think it's fine to do it in a later patch. This series is long enough already. I was also annoyed by the duplication. They way I thought of fixing it is having gva_to_gpa() return two gpas, and having the access function accept gpa vectors. The reason was so that we can implemented locked cross-page operations (which we now emulate as unlocked writes). But I think we can do without it, and instead emulated locked cross-page ops by stalling all other vcpus while we write, or by unmapping the pages involved. It isn't pretty but it doesn't need to be fast since it's a very rare operation. So I think we can go with your approach. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/11] KVM in-guest performance monitoring
On Wed, 2011-06-29 at 10:52 +0300, Avi Kivity wrote: On 06/13/2011 04:34 PM, Avi Kivity wrote: This patchset exposes an emulated version 1 architectural performance monitoring unit to KVM guests. The PMU is emulated using perf_events, so the host kernel can multiplex host-wide, host-user, and the guest on available resources. Caveats: - counters that have PMI (interrupt) enabled stop counting after the interrupt is signalled. This is because we need one-shot samples that keep counting, which perf doesn't support yet - some combinations of INV and CMASK are not supported - counters keep on counting in the host as well as the guest perf maintainers: please consider the first three patches for merging (the first two make sense even without the rest). If you're familiar with the Intel PMU, please review patch 5 as well - it effectively undoes all your work of abstracting the PMU into perf_events by unabstracting perf_events into what is hoped is a very similar PMU. v2: - don't pass perf_event handler context to the callback; extract it via the 'event' parameter instead - RDPMC emulation and interception - CR4.PCE emulation Peter, can you look at 1-3 please? Queued them, thanks! I was more or less waiting for a next iteration of the series because of those problems reported, but those three stand well on their own. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFT: virtio_net: limit xmit polling
On Tue, Jun 28, 2011 at 11:08:07AM -0500, Tom Lendacky wrote: On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote: OK, different people seem to test different trees. In the hope to get everyone on the same page, I created several variants of this patch so they can be compared. Whoever's interested, please check out the following, and tell me how these compare: kernel: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git virtio-net-limit-xmit-polling/base - this is net-next baseline to test against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity virtio-net-limit-xmit-polling/v1 - previous revision of the patch this does xmit,free,xmit,2*free,free virtio-net-limit-xmit-polling/v2 - new revision of the patch this does free,xmit,2*free,free Here's a summary of the results. I've also attached an ODS format spreadsheet (30 KB in size) that might be easier to analyze and also has some pinned VM results data. I broke the tests down into a local guest-to-guest scenario and a remote host-to-guest scenario. Within the local guest-to-guest scenario I ran: - TCP_RR tests using two different messsage sizes and four different instance counts among 1 pair of VMs and 2 pairs of VMs. - TCP_STREAM tests using four different message sizes and two different instance counts among 1 pair of VMs and 2 pairs of VMs. Within the remote host-to-guest scenario I ran: - TCP_RR tests using two different messsage sizes and four different instance counts to 1 VM and 4 VMs. - TCP_STREAM and TCP_MAERTS tests using four different message sizes and two different instance counts to 1 VM and 4 VMs. over a 10GbE link. roprabhu, Tom, Thanks very much for the testing. So on the first glance one seems to see a significant performance gain in V0 here, and a slightly less significant in V2, with V1 being worse than base. But I'm afraid that's not the whole story, and we'll need to work some more to know what really goes on, please see below. Some comments on the results: I found out that V0 because of mistake on my part was actually almost identical to base. I pushed out virtio-net-limit-xmit-polling/v1a instead that actually does what I intended to check. However, the fact we get such a huge distribution in the results by Tom most likely means that the noise factor is very large. From my experience one way to get stable results is to divide the throughput by the host CPU utilization (measured by something like mpstat). Sometimes throughput doesn't increase (e.g. guest-host) by CPU utilization does decrease. So it's interesting. Another issue is that we are trying to improve the latency of a busy queue here. However STREAM/MAERTS tests ignore the latency (more or less) while TCP_RR by default runs a single packet per queue. Without arguing about whether these are practically interesting workloads, these results are thus unlikely to be significantly affected by the optimization in question. What we are interested in, thus, is either TCP_RR with a -b flag (configure with --enable-burst) or multiple concurrent TCP_RRs. *** Local Guest-to-Guest *** Here's the local guest-to-guest summary for 1 VM pair doing TCP_RR with 256/256 request/response message size in transactions per second: Instances BaseV0 V1 V2 1 8,151.568,460.728,439.169,990.37 2548,761.74 51,032.62 51,103.25 49,533.52 5055,687.38 55,974.18 56,854.10 54,888.65 100 58,255.06 58,255.86 60,380.90 59,308.36 Here's the local guest-to-guest summary for 2 VM pairs doing TCP_RR with 256/256 request/response message size in transactions per second: Instances BaseV0 V1 V2 1 18,758.48 19,112.50 18,597.07 19,252.04 2580,500.50 78,801.78 80,590.68 78,782.07 5080,594.20 77,985.44 80,431.72 77,246.90 100 82,023.23 81,325.96 81,303.32 81,727.54 Here's the local guest-to-guest summary for 1 VM pair doing TCP_STREAM with 256, 1K, 4K and 16K message size in Mbps: 256: Instances BaseV0 V1 V2 1961.781,115.92 794.02 740.37 4 2,498.332,541.822,441.602,308.26 1K: 1 3,476.613,522.022,170.861,395.57 4 6,344.307,056.577,275.167,174.09 4K: 1 9,213.57 10,647.449,883.429,007.29 4 11,070.66 11,300.37 11,001.02 12,103.72 16K: 1 12,065.949,437.78
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 10:23:26AM +0200, Paolo Bonzini wrote: On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote: If a device uses more than one queue it is the responsibility of the device to ensure strict request ordering. Maybe I misunderstand - how can this be the responsibility of the device if the device does not get the information about the original ordering of the requests? For example, if the driver is crazy enough to put all write requests on one queue and all barriers on another one, how is the device supposed to ensure ordering? I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. Paolo Like this then? If a driver uses more than one queue it is the responsibility of the driver to ensure strict request ordering: the device does not supply any guarantees about the ordering of requests between different virtqueues. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
On 06/22/2011 05:31 PM, Xiao Guangrong wrote: If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Does this work if the mmio spans two pages? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] perf support for amd guest/host-only bits v2
On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote: On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote: this is the second version of the patch-set to support the AMD guest-/host only bits in the performance counter MSRs. Due to lack of time I havn't looked into emulating support for this feature on Intel or other architectures, but the other comments should be worked in. The changes to v1 include: * Rebased patches to v3.0-rc3 * Allow exclude_guest and exclude_host set at the same time * Reworked event-parse logic for the new exclude-bits * Only count guest-events per default from perf-kvm Hi Peter, Ingo, have you had a chance to look at this patch-set? Are any changes required? I would feel a lot more comfortable by having it implemented on all of x86 as well as at least one !x86 platform. Avi graciously volunteered for the Intel bits. Paulus, I hear from benh that you're also responsible for the ppc-kvm bits, could you possibly find some time to implement this feature for ppc? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V7 4/4 net-next] vhost: vhost TX zero-copy support
On Sat, May 28, 2011 at 12:34:27PM -0700, Shirley Ma wrote: Hello Michael, In order to use wait for completion in shutting down, seems to me another work thread is needed to call vhost_zerocopy_add_used, Hmm I don't see vhost_zerocopy_add_used here. it seems too much work to address a minor issue here. Do we really need it? Assuming you mean vhost_zerocopy_signal_used, here's how I would do it: add a kref and a completion, signal completion in kref_put callback, when backend is set - kref_get, on cleanup, kref_put and then wait_for_completion_interruptible. Where's the need for another thread coming from? If you like, post a patch with busywait + a FIXME comment, and I can write up a patch on top. (BTW, ideally the function that does the signalling should be in core networking bits so that it's still around even if the vhost module gets removed). Right now, the approach I am using is to ignore outstanding userspace buffers during shutting down if any, the device might DMAed some wrong data to the wire, do we really care? Thanks Shirley I think so, yes, guest is told that memory can be reused so it might put the credit card number or whatever there :) This patch maintains the outstanding userspace buffers in the sequence it is delivered to vhost. The outstanding userspace buffers will be marked as done once the lower device buffers DMA has finished. This is monitored through last reference of kfree_skb callback. Two buffer index are used for this purpose. The vhost passes the userspace buffers info to lower device skb through message control. Since there will be some done DMAs when entering vhost handle_tx. The worse case is all buffers in the vq are in pending/done status, so we need to notify guest to release DMA done buffers first before get any new buffers from the vq. Signed-off-by: Shirley x...@us.ibm.com --- drivers/vhost/net.c | 46 +- drivers/vhost/vhost.c | 47 +++ drivers/vhost/vhost.h | 15 +++ 3 files changed, 107 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 2f7c76a..e2eaba6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,11 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +/* MAX number of TX used buffers for outstanding zerocopy */ +#define VHOST_MAX_PEND 128 +/* change it to 256 when small message size performance issue is addressed */ +#define VHOST_GOODCOPY_LEN 2048 + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net) hdr_size = vq-vhost_hlen; for (;;) { + /* Release DMAs done buffers first */ + if (atomic_read(vq-refcnt) VHOST_MAX_PEND) + vhost_zerocopy_signal_used(vq, false); + head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), out, in, @@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net) set_bit(SOCK_ASYNC_NOSPACE, sock-flags); break; } + /* If more outstanding DMAs, queue the work */ + if (atomic_read(vq-refcnt) VHOST_MAX_PEND) { + tx_poll_start(net, sock); + set_bit(SOCK_ASYNC_NOSPACE, sock-flags); + break; + } if (unlikely(vhost_enable_notify(vq))) { vhost_disable_notify(vq); continue; @@ -188,6 +203,26 @@ static void handle_tx(struct vhost_net *net) iov_length(vq-hdr, s), hdr_size); break; } + /* use msg_control to pass vhost zerocopy ubuf info to skb */ + if (sock_flag(sock-sk, SOCK_ZEROCOPY)) { + vq-heads[vq-upend_idx].id = head; + if (len VHOST_GOODCOPY_LEN) + /* copy don't need to wait for DMA done */ + vq-heads[vq-upend_idx].len = + VHOST_DMA_DONE_LEN; + else { + struct ubuf_info *ubuf = vq-ubuf_info[head]; + + vq-heads[vq-upend_idx].len = len; + ubuf-callback = vhost_zerocopy_callback; + ubuf-arg = vq; + ubuf-desc = vq-upend_idx; + msg.msg_control = ubuf; + msg.msg_controllen = sizeof(ubuf); + } +
Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
On 06/22/2011 05:35 PM, Xiao Guangrong wrote: Use rcu to protect shadow pages table to be freed, so we can safely walk it, it should run fastly and is needed by mmio page fault static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) { @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); + if (atomic_read(kvm-arch.reader_counter)) { + kvm_mmu_isolate_pages(invalid_list); + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); + list_del_init(invalid_list); + call_rcu(sp-rcu, free_pages_rcu); + return; + } + I think we should do this unconditionally. The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts. On the other hand, zap_page is very rare, so it can be a little slower. Also, less code paths = easier to understand. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
On 06/22/2011 05:36 PM, Xiao Guangrong wrote: The idea is from Avi: | We could cache the result of a miss in an spte by using a reserved bit, and | checking the page fault error code (or seeing if we get an ept violation or | ept misconfiguration), so if we get repeated mmio on a page, we don't need to | search the slot list/tree. | (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1319050..e69a47a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;/* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; +static u64 __read_mostly shadow_mmio_mask = (0xffull 49 | 1ULL); One bit is shifted out. And it will fail with 52-bit MAXPHYADDR. Please in addition, set the xwr bits to an invalid pattern on EPT (there is an MSR which specifies which patterns are valid; for example execute-only or write-only are invalid). If all patterns are valid AND MAXPHYADDR == 52, then just set the mask to 0 and it the optimization will be disabled. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/22] KVM: optimize for MMIO handled
On 06/22/2011 05:27 PM, Xiao Guangrong wrote: In this version, we fix the bugs in the v1: - fix broken read emulation spans a page boundary - fix invalid spte point is got if we walk shadow page table out of the mmu lock And, we also introduce some rules to modify spte in this version, then it does not need to atomically clear/set spte on x86_32 host anymore, the performance report of x86_32 host is in the later section Avi, I have sampled the operation of lockless shadow page walking as below steps: - mark walk_shadow_page_get_mmio_spte as 'noinline' - do the netperf test, the client is on the guest(NIC is e1000) and the server is on the host, it can generate large press of mmio access - using perf to sample it, and the result of 'perf report' is attached The ratio of walk_shadow_page_get_mmio_spte is 0.09%, the ratio of handle_ept_misconfig is 0.11%, the ratio of handle_mmio_page_fault_common is 0.07% I think it is acceptable, your opinion? Yes. The patchset scares me, but it is nice work! Good optimization and good clean up. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/11] KVM in-guest performance monitoring
On 06/29/2011 11:38 AM, Peter Zijlstra wrote: Peter, can you look at 1-3 please? Queued them, thanks! I was more or less waiting for a next iteration of the series because of those problems reported, but those three stand well on their own. Thanks. I'm mired in other work but will return to investigate fix those issues. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] perf support for amd guest/host-only bits v2
On 06/29/2011 12:02 PM, Peter Zijlstra wrote: have you had a chance to look at this patch-set? Are any changes required? I would feel a lot more comfortable by having it implemented on all of x86 as well as at least one !x86 platform. Avi graciously volunteered for the Intel bits. Silly me. Joerg, can you post the git tree publicly please? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 9:33 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/14/2011 10:39 AM, Hannes Reinecke wrote: If, however, we decide to expose some details about the backend, we could be using the values from the backend directly. EG we could be forwarding the SCSI target port identifier here (if backed by real hardware) or creating our own SAS-type identifier when backed by qemu block. Then we could just query the backend via a new command on the controlq (eg 'list target ports') and wouldn't have to worry about any protocol specific details here. Besides the controlq command, which I can certainly add, this is actually quite similar to what I had in mind (though my plan likely would not have worked because I was expecting hierarchical LUNs used uniformly). Â So, list target ports would return a set of LUN values to which you can send REPORT LUNS, or something like that? I think we're missing a level of addressing. We need the ability to talk to multiple target ports in order for list target ports to make sense. Right now there is one implicit target that handles all commands. That means there is one fixed I_T Nexus. If we introduce list target ports we also need a way to say This CDB is destined for target port #0. Then it is possible to enumerate target ports and address targets independently of the LUN field in the CDB. I'm pretty sure this is also how SAS and other transports work. In their framing they include the target port. The question is whether we really need to support multiple targets on a virtio-scsi adapter or not. If you are selectively mapping LUNs that the guest may access, then multiple targets are not necessary. If we want to do pass-through of the entire SCSI bus then we need multiple targets but I'm not sure if there are other challenges like dependencies on the transport (Fibre Channel, SAS, etc) which make it impossible to pass through bus-level access? If I understand it correctly, it should remain possible to use a single host for both pass-through and emulated targets. Yes. Of course, when doing so we would be lose the ability to freely remap LUNs. But then remapping LUNs doesn't gain you much imho. Plus you could always use qemu block backend here if you want to hide the details. And you could always use the QEMU block backend with scsi-generic if you want to remap LUNs, instead of true passthrough via the kernel target. IIUC the in-kernel target always does remapping. It passes through individual LUNs rather than entire targets and you pick LU Numbers to map to the backing storage (which may or may not be a SCSI pass-through device). Nicholas Bellinger can confirm whether this is correct. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4] qemu-img: Add cache command line option
Am 20.06.2011 18:48, schrieb Federico Simoncelli: qemu-img currently writes disk images using writeback and filling up the cache buffers which are then flushed by the kernel preventing other processes from accessing the storage. This is particularly bad in cluster environments where time-based algorithms might be in place and accessing the storage within certain timeouts is critical. This patch adds the option to choose a cache method when writing disk images. Signed-off-by: Federico Simoncelli fsimo...@redhat.com Thanks, applied to the block branch. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] perf support for amd guest/host-only bits v2
On Wed, Jun 29, 2011 at 12:27:48PM +0300, Avi Kivity wrote: On 06/29/2011 12:02 PM, Peter Zijlstra wrote: have you had a chance to look at this patch-set? Are any changes required? I would feel a lot more comfortable by having it implemented on all of x86 as well as at least one !x86 platform. Avi graciously volunteered for the Intel bits. Silly me. Joerg, can you post the git tree publicly please? Okay, I pushed it to git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-kvm.git perf-guest-counting It probably takes some time until it appears on the mirrors. Thanks, Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] perf support for amd guest/host-only bits v2
On Wed, Jun 29, 2011 at 05:02:54AM -0400, Peter Zijlstra wrote: On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote: On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote: this is the second version of the patch-set to support the AMD guest-/host only bits in the performance counter MSRs. Due to lack of time I havn't looked into emulating support for this feature on Intel or other architectures, but the other comments should be worked in. The changes to v1 include: * Rebased patches to v3.0-rc3 * Allow exclude_guest and exclude_host set at the same time * Reworked event-parse logic for the new exclude-bits * Only count guest-events per default from perf-kvm Hi Peter, Ingo, have you had a chance to look at this patch-set? Are any changes required? I would feel a lot more comfortable by having it implemented on all of x86 as well as at least one !x86 platform. Avi graciously volunteered for the Intel bits. Ok, since no changes are required from my side then, how about adding support for more hardware successively like it was done for perf-kvm? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Tue, Jun 14, 2011 at 05:30:24PM +0200, Hannes Reinecke wrote: Which is exactly the problem I was referring to. When using more than one channel the request ordering _as seen by the initiator_ has to be preserved. This is quite hard to do from a device's perspective; it might be able to process the requests _in the order_ they've arrived, but it won't be able to figure out the latency of each request, ie how it'll take the request to be delivered to the initiator. What we need to do here is to ensure that virtio will deliver the requests in-order across all virtqueues. Not sure whether it does this already. This only matters for ordered tags, or implicit or explicit HEAD OF QUEUE tags. For everything else there's no ordering requirement. Given that ordered tags don't matter in practice and we don't have to support them this just leaves HEAD OF QUEUE. I suspect the HEAD OF QUEUE semantics need to be implemented using underlying draining of all queues, which should be okay given that it's usually used in slow path commands. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Sun, Jun 12, 2011 at 10:51:41AM +0300, Michael S. Tsirkin wrote: For example, if the driver is crazy enough to put all write requests on one queue and all barriers on another one, how is the device supposed to ensure ordering? There is no such things as barriers in SCSI. The thing that comes closest is ordered tags, which neither Linux nor any mainstream OS uses, and which we don't have to (and generally don't want to) implement. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 10:23:26AM +0200, Paolo Bonzini wrote: I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:03 PM, Christoph Hellwig wrote: I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. Agreed, I see a use for multiple queues (MSI-X), but not for multiple queues shared by a single LUN. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote: I think we're missing a level of addressing. We need the ability to talk to multiple target ports in order for list target ports to make sense. Right now there is one implicit target that handles all commands. That means there is one fixed I_T Nexus. If we introduce list target ports we also need a way to say This CDB is destined for target port #0. Then it is possible to enumerate target ports and address targets independently of the LUN field in the CDB. I'm pretty sure this is also how SAS and other transports work. In their framing they include the target port. Yes, exactly. Hierachial LUNs are a nasty fringe feature that we should avoid as much as possible, that is for everything but IBM vSCSI which is braindead enough to force them. The question is whether we really need to support multiple targets on a virtio-scsi adapter or not. If you are selectively mapping LUNs that the guest may access, then multiple targets are not necessary. If we want to do pass-through of the entire SCSI bus then we need multiple targets but I'm not sure if there are other challenges like dependencies on the transport (Fibre Channel, SAS, etc) which make it impossible to pass through bus-level access? I don't think bus-level pass through is either easily possible nor desirable. What multiple targets are useful for is allowing more virtual disks than we have virtual PCI slots. We could do this by supporting multiple LUNs, but given that many SCSI ressources are target-based doing multiple targets most likely is the more scabale and more logical variant. E.g. we could much more easily have one virtqueue per target than per LUN. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for June 28
On Wed, Jun 29, 2011 at 8:57 AM, Kevin Wolf kw...@redhat.com wrote: Am 28.06.2011 21:41, schrieb Marcelo Tosatti: stream -- 1) base - remote 2) base - remote - local 3) base - local local image is always valid. Requires backing file support. With the above, this restriction wouldn't apply any more. Also I don't think we should mix approaches. Either both block copy and image streaming use backing files, or none of them do. Mixing means duplicating more code, and even worse, that you can't stop a block copy in the middle and continue with streaming (which I believe is a really valuable feature to have). Here is how the image streaming feature is used from HMP/QMP: The guest is running from an image file with a backing file. The aim is to pull the data from the backing file and populate the image file so that the dependency on the backing file can be eliminated. 1. Start a background streaming operation: (qemu) block_stream -a ide0-hd 2. Check the status of the operation: (qemu) info block-stream Streaming device ide0-hd: Completed 512 of 34359738368 bytes 3. The status changes when the operation completes: (qemu) info block-stream No active stream On completion the image file no longer has a backing file dependency. When streaming completes QEMU updates the image file metadata to indicate that no backing file is used. The QMP interface is similar but provides QMP events to signal streaming completion and failure. Polling to query the streaming status is only used when the management application wishes to refresh progress information. If guest execution is interrupted by a power failure or QEMU crash, then the image file is still valid but streaming may be incomplete. When QEMU is launched again the block_stream command can be issued to resume streaming. In the future we could add a 'base' argument to block_stream. If base is specified then data contained in the base image will not be copied. This can be used to merge data from an intermediate image without merging the base image. When streaming completes the backing file will be set to the base image. The backing file relationship would typically look like this: 1. Before block_stream -a -b base.img ide0-hd completion: base.img - sn1 - ... - ide0-hd.qed 2. After streaming completes: base.img - ide0-hd.qed This describes the image streaming use cases that I, Adam, and Anthony propose to support. In the course of the discussion we've sometimes been distracted with the internals of what a unified live block copy/image streaming implementation should do. I wanted to post this summary of image streaming to refocus us on the use case and the APIs that users will see. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:07 PM, Christoph Hellwig wrote: On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote: I think we're missing a level of addressing. We need the ability to talk to multiple target ports in order for list target ports to make sense. Right now there is one implicit target that handles all commands. That means there is one fixed I_T Nexus. If we introduce list target ports we also need a way to say This CDB is destined for target port #0. Then it is possible to enumerate target ports and address targets independently of the LUN field in the CDB. I'm pretty sure this is also how SAS and other transports work. In their framing they include the target port. Yes, exactly. Hierachial LUNs are a nasty fringe feature that we should avoid as much as possible, that is for everything but IBM vSCSI which is braindead enough to force them. Yep. The question is whether we really need to support multiple targets on a virtio-scsi adapter or not. If you are selectively mapping LUNs that the guest may access, then multiple targets are not necessary. If we want to do pass-through of the entire SCSI bus then we need multiple targets but I'm not sure if there are other challenges like dependencies on the transport (Fibre Channel, SAS, etc) which make it impossible to pass through bus-level access? I don't think bus-level pass through is either easily possible nor desirable. What multiple targets are useful for is allowing more virtual disks than we have virtual PCI slots. We could do this by supporting multiple LUNs, but given that many SCSI ressources are target-based doing multiple targets most likely is the more scabale and more logical variant. E.g. we could much more easily have one virtqueue per target than per LUN. The general idea here is that we can support NPIV. With NPIV we'll have several scsi_hosts, each of which is assigned a different set of LUNs by the array. With virtio we need to able to react on LUN remapping on the array side, ie we need to be able to issue a 'REPORT LUNS' command and add/remove LUNs on the fly. This means we have to expose the scsi_host in some way via virtio. This is impossible with a one-to-one mapping between targets and LUNs. The actual bus-level pass-through will be just on the SCSI layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN remapping internally on the host is a totally different matter. Same goes for the transport details; I doubt we will expose all the dingy details of the various transports, but rather restrict ourselves to an abstract transport. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2] access: check SMEP on prefetch pte path
This patch adds SMEP to all test cases and checks SMEP on prefetch pte path when cr0.wp=0. changes since v1: Add SMEP to all test cases and verify it before setting cr4 Signed-off-by: Yang, Wei wei.y.y...@intel.com Signed-off-by: Shan, Haitao haitao.s...@intel.com Signed-off-by: Li, Xin xin...@intel.com --- x86/access.c | 108 ++-- x86/cstart64.S |1 + 2 files changed, 106 insertions(+), 3 deletions(-) diff --git a/x86/access.c b/x86/access.c index 7c8b9a5..22e5988 100644 --- a/x86/access.c +++ b/x86/access.c @@ -27,6 +27,7 @@ typedef unsigned long pt_element_t; #define PT_NX_MASK ((pt_element_t)1 63) #define CR0_WP_MASK (1UL 16) +#define CR4_SMEP_MASK (1UL 20) #define PFERR_PRESENT_MASK (1U 0) #define PFERR_WRITE_MASK (1U 1) @@ -70,6 +71,7 @@ enum { AC_CPU_EFER_NX, AC_CPU_CR0_WP, +AC_CPU_CR4_SMEP, NR_AC_FLAGS }; @@ -96,6 +98,7 @@ const char *ac_names[] = { [AC_ACCESS_TWICE] = twice, [AC_CPU_EFER_NX] = efer.nx, [AC_CPU_CR0_WP] = cr0.wp, +[AC_CPU_CR4_SMEP] = cr4.smep, }; static inline void *va(pt_element_t phys) @@ -130,6 +133,14 @@ typedef struct { static void ac_test_show(ac_test_t *at); +int write_cr4_checking(unsigned long val) +{ +asm volatile(ASM_TRY(1f) +mov %0,%%cr4\n\t +1:: : r (val)); +return exception_vector(); +} + void set_cr0_wp(int wp) { unsigned long cr0 = read_cr0(); @@ -140,6 +151,16 @@ void set_cr0_wp(int wp) write_cr0(cr0); } +void set_cr4_smep(int smep) +{ +unsigned long cr4 = read_cr4(); + +cr4 = ~CR4_SMEP_MASK; +if (smep) + cr4 |= CR4_SMEP_MASK; +write_cr4(cr4); +} + void set_efer_nx(int nx) { unsigned long long efer; @@ -187,7 +208,12 @@ int ac_test_bump_one(ac_test_t *at) _Bool ac_test_legal(ac_test_t *at) { -if (at-flags[AC_ACCESS_FETCH] at-flags[AC_ACCESS_WRITE]) +/* + * Since we convert current page to kernel page when cr4.smep=1, + * we can't switch to user mode. + */ +if ((at-flags[AC_ACCESS_FETCH] at-flags[AC_ACCESS_WRITE]) || +(at-flags[AC_ACCESS_USER] at-flags[AC_CPU_CR4_SMEP])) return false; return true; } @@ -287,6 +313,9 @@ void ac_set_expected_status(ac_test_t *at) if (at-flags[AC_PDE_PSE]) { if (at-flags[AC_ACCESS_WRITE] !at-expected_fault) at-expected_pde |= PT_DIRTY_MASK; + if (at-flags[AC_ACCESS_FETCH] at-flags[AC_PDE_USER] +at-flags[AC_CPU_CR4_SMEP]) + at-expected_fault = 1; goto no_pte; } @@ -306,7 +335,11 @@ void ac_set_expected_status(ac_test_t *at) (at-flags[AC_CPU_CR0_WP] || at-flags[AC_ACCESS_USER])) at-expected_fault = 1; -if (at-flags[AC_ACCESS_FETCH] at-flags[AC_PTE_NX]) +if (at-flags[AC_ACCESS_FETCH] +(at-flags[AC_PTE_NX] + || (at-flags[AC_CPU_CR4_SMEP] +at-flags[AC_PDE_USER] +at-flags[AC_PTE_USER]))) at-expected_fault = 1; if (at-expected_fault) @@ -320,7 +353,7 @@ no_pte: fault: if (!at-expected_fault) at-ignore_pde = 0; -if (!at-flags[AC_CPU_EFER_NX]) +if (!at-flags[AC_CPU_EFER_NX] !at-flags[AC_CPU_CR4_SMEP]) at-expected_error = ~PFERR_FETCH_MASK; } @@ -469,6 +502,14 @@ int ac_test_do_access(ac_test_t *at) unsigned r = unique; set_cr0_wp(at-flags[AC_CPU_CR0_WP]); set_efer_nx(at-flags[AC_CPU_EFER_NX]); +if (at-flags[AC_CPU_CR4_SMEP] !(cpuid(7).b (1 7))) { + unsigned long cr4 = read_cr4(); + if (write_cr4_checking(cr4 | CR4_SMEP_MASK) == GP_VECTOR) + goto done; + printf(Set SMEP in CR4 - expect #GP: FAIL!\n); + return 0; +} +set_cr4_smep(at-flags[AC_CPU_CR4_SMEP]); if (at-flags[AC_ACCESS_TWICE]) { asm volatile ( @@ -544,6 +585,7 @@ int ac_test_do_access(ac_test_t *at) !pt_match(*at-pdep, at-expected_pde, at-ignore_pde), pde %x expected %x, *at-pdep, at-expected_pde); +done: if (success verbose) { printf(PASS\n); } @@ -645,6 +687,59 @@ err: return 0; } +static int check_smep_on_prefetch_pte(ac_pool_t *pool) +{ + ac_test_t at1; + int err_prepare_notwp, err_smep_notwp; + extern u64 ptl2[]; + + ac_test_init(at1, (void *)(0x123406001000)); + + at1.flags[AC_PDE_PRESENT] = 1; + at1.flags[AC_PTE_PRESENT] = 1; + at1.flags[AC_PDE_USER] = 1; + at1.flags[AC_PTE_USER] = 1; + at1.flags[AC_PDE_ACCESSED] = 1; + at1.flags[AC_PTE_ACCESSED] = 1; + at1.flags[AC_CPU_CR4_SMEP] = 1; + at1.flags[AC_CPU_CR0_WP] = 0; + at1.flags[AC_ACCESS_WRITE] = 1; + ac_test_setup_pte(at1, pool); + ptl2[2] -= 0x4; + + /* +* Here we write the ro user page when +* cr0.wp=0, then we execute it and SMEP +* fault should happen. +*/ + err_prepare_notwp
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 12:23:38PM +0200, Hannes Reinecke wrote: The general idea here is that we can support NPIV. With NPIV we'll have several scsi_hosts, each of which is assigned a different set of LUNs by the array. With virtio we need to able to react on LUN remapping on the array side, ie we need to be able to issue a 'REPORT LUNS' command and add/remove LUNs on the fly. This means we have to expose the scsi_host in some way via virtio. This is impossible with a one-to-one mapping between targets and LUNs. The actual bus-level pass-through will be just on the SCSI layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN remapping internally on the host is a totally different matter. Same goes for the transport details; I doubt we will expose all the dingy details of the various transports, but rather restrict ourselves to an abstract transport. If we want to support traditional NPIV that's what we have to do. I still hope we'll see broad SR-IOV support for FC adapters soon, which would ease all this greatly. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On Wed, Jun 29, 2011 at 12:06:29PM +0200, Paolo Bonzini wrote: On 06/29/2011 12:03 PM, Christoph Hellwig wrote: I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. Agreed, I see a use for multiple queues (MSI-X), but not for multiple queues shared by a single LUN. Paolo Then let's make it explicit in the spec? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:31 PM, Michael S. Tsirkin wrote: On Wed, Jun 29, 2011 at 12:06:29PM +0200, Paolo Bonzini wrote: On 06/29/2011 12:03 PM, Christoph Hellwig wrote: I agree here, in fact I misread Hannes's comment as if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering. If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. Agreed, I see a use for multiple queues (MSI-X), but not for multiple queues shared by a single LUN. Then let's make it explicit in the spec? What, forbid it or say ordering is not guaranteed? The latter is already explicit with the wording suggested in the thread. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/17] KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down
This arranges for the top-level arch/powerpc/kvm/powerpc.c file to pass down some of the calls it gets to the lower-level subarchitecture specific code. The lower-level implementations (in booke.c and book3s.c) are no-ops. The coming book3s_hv.c will need this. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_ppc.h |7 +++ arch/powerpc/kvm/book3s_pr.c | 20 arch/powerpc/kvm/booke.c | 20 arch/powerpc/kvm/powerpc.c |9 ++--- 4 files changed, 53 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index c662f14..9b6f3f9 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -111,6 +111,13 @@ extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu); extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu); extern void kvmppc_map_magic(struct kvm_vcpu *vcpu); +extern int kvmppc_core_init_vm(struct kvm *kvm); +extern void kvmppc_core_destroy_vm(struct kvm *kvm); +extern int kvmppc_core_prepare_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem); +extern void kvmppc_core_commit_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem); + /* * Cuts out inst bits with ordering according to spec. * That means the leftmost bit is zero. All given bits are included. diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index fcdc97e..72b20b8 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -984,6 +984,26 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) return ret; } +int kvmppc_core_prepare_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem) +{ + return 0; +} + +void kvmppc_core_commit_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem) +{ +} + +int kvmppc_core_init_vm(struct kvm *kvm) +{ + return 0; +} + +void kvmppc_core_destroy_vm(struct kvm *kvm) +{ +} + static int kvmppc_book3s_init(void) { int r; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 9f2e4a5..9066325 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -865,6 +865,26 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) return -ENOTSUPP; } +int kvmppc_core_prepare_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem) +{ + return 0; +} + +void kvmppc_core_commit_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem) +{ +} + +int kvmppc_core_init_vm(struct kvm *kvm) +{ + return 0; +} + +void kvmppc_core_destroy_vm(struct kvm *kvm) +{ +} + int __init kvmppc_booke_init(void) { unsigned long ivor[16]; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 24e2b64..0c80e15 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -148,7 +148,7 @@ void kvm_arch_check_processor_compat(void *rtn) int kvm_arch_init_vm(struct kvm *kvm) { - return 0; + return kvmppc_core_init_vm(kvm); } void kvm_arch_destroy_vm(struct kvm *kvm) @@ -164,6 +164,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm-vcpus[i] = NULL; atomic_set(kvm-online_vcpus, 0); + + kvmppc_core_destroy_vm(kvm); + mutex_unlock(kvm-lock); } @@ -212,7 +215,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, int user_alloc) { - return 0; + return kvmppc_core_prepare_memory_region(kvm, mem); } void kvm_arch_commit_memory_region(struct kvm *kvm, @@ -220,7 +223,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, struct kvm_memory_slot old, int user_alloc) { - return; + kvmppc_core_commit_memory_region(kvm, mem); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/17] powerpc, KVM: Rework KVM checks in first-level interrupt handlers
Instead of branching out-of-line with the DO_KVM macro to check if we are in a KVM guest at the time of an interrupt, this moves the KVM check inline in the first-level interrupt handlers. This speeds up the non-KVM case and makes sure that none of the interrupt handlers are missing the check. Because the first-level interrupt handlers are now larger, some things had to be move out of line in exceptions-64s.S. This all necessitated some minor changes to the interrupt entry code in KVM. This also streamlines the book3s_32 KVM test. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/exception-64s.h | 121 -- arch/powerpc/kernel/exceptions-64s.S | 189 +--- arch/powerpc/kvm/book3s_rmhandlers.S | 78 ++-- arch/powerpc/kvm/book3s_segment.S |7 + arch/powerpc/platforms/iseries/exception.S |2 +- arch/powerpc/platforms/iseries/exception.h |4 +- 6 files changed, 247 insertions(+), 154 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index f5dfe34..b6a3a44 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -61,19 +61,22 @@ #define EXC_HV H #define EXC_STD -#define EXCEPTION_PROLOG_1(area) \ +#define __EXCEPTION_PROLOG_1(area, extra, vec) \ GET_PACA(r13); \ std r9,area+EX_R9(r13); /* save r9 - r12 */ \ std r10,area+EX_R10(r13); \ - std r11,area+EX_R11(r13); \ - std r12,area+EX_R12(r13); \ BEGIN_FTR_SECTION_NESTED(66); \ mfspr r10,SPRN_CFAR; \ std r10,area+EX_CFAR(r13); \ END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66); \ - GET_SCRATCH0(r9); \ - std r9,area+EX_R13(r13);\ - mfcrr9 + mfcrr9; \ + extra(vec); \ + std r11,area+EX_R11(r13); \ + std r12,area+EX_R12(r13); \ + GET_SCRATCH0(r10); \ + std r10,area+EX_R13(r13) +#define EXCEPTION_PROLOG_1(area, extra, vec) \ + __EXCEPTION_PROLOG_1(area, extra, vec) #define __EXCEPTION_PROLOG_PSERIES_1(label, h) \ ld r12,PACAKBASE(r13); /* get high part of label */ \ @@ -85,13 +88,54 @@ mtspr SPRN_##h##SRR1,r10; \ h##rfid;\ b . /* prevent speculative execution */ -#define EXCEPTION_PROLOG_PSERIES_1(label, h) \ +#define EXCEPTION_PROLOG_PSERIES_1(label, h) \ __EXCEPTION_PROLOG_PSERIES_1(label, h) -#define EXCEPTION_PROLOG_PSERIES(area, label, h) \ - EXCEPTION_PROLOG_1(area); \ +#define EXCEPTION_PROLOG_PSERIES(area, label, h, extra, vec) \ + EXCEPTION_PROLOG_1(area, extra, vec); \ EXCEPTION_PROLOG_PSERIES_1(label, h); +#define __KVMTEST(n) \ + lbz r10,PACA_KVM_SVCPU+SVCPU_IN_GUEST(r13); \ + cmpwi r10,0; \ + bne do_kvm_##n + +#define __KVM_HANDLER(area, h, n) \ +do_kvm_##n:\ + ld r10,area+EX_R10(r13); \ + stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13); \ + ld r9,area+EX_R9(r13); \ + std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \ + li r12,n; \ + b kvmppc_interrupt + +#define __KVM_HANDLER_SKIP(area, h, n) \ +do_kvm_##n:\ + cmpwi r10,KVM_GUEST_MODE_SKIP;\ + ld r10,area+EX_R10(r13); \ + beq 89f;\ + stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13); \ + ld r9,area+EX_R9(r13);
[PATCH 07/17] KVM: PPC: Move guest enter/exit down into subarch-specific code
Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable() calls in powerpc.c, this moves them down into the subarch-specific book3s_pr.c and booke.c. This eliminates an extra local_irq_enable() call in book3s_pr.c, and will be needed for when we do SMT4 guest support in the book3s hypervisor mode code. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_ppc.h |1 + arch/powerpc/kvm/book3s_interrupts.S |2 +- arch/powerpc/kvm/book3s_pr.c | 12 ++-- arch/powerpc/kvm/booke.c | 13 + arch/powerpc/kvm/powerpc.c |6 +- 5 files changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 9b6f3f9..48b7ab7 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -42,6 +42,7 @@ enum emulation_result { EMULATE_AGAIN,/* something went wrong. go again */ }; +extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu); extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu); extern char kvmppc_handlers_start[]; extern unsigned long kvmppc_handler_len; diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S index 2f0bc92..8c5e0e1 100644 --- a/arch/powerpc/kvm/book3s_interrupts.S +++ b/arch/powerpc/kvm/book3s_interrupts.S @@ -85,7 +85,7 @@ * r3: kvm_run pointer * r4: vcpu pointer */ -_GLOBAL(__kvmppc_vcpu_entry) +_GLOBAL(__kvmppc_vcpu_run) kvm_start_entry: /* Write correct stack frame */ diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 72b20b8..0c0d3f2 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -891,8 +891,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) vfree(vcpu_book3s); } -extern int __kvmppc_vcpu_entry(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu); -int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) +int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) { int ret; double fpr[32][TS_FPRWIDTH]; @@ -944,14 +943,15 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) /* Remember the MSR with disabled extensions */ ext_msr = current-thread.regs-msr; - /* XXX we get called with irq disabled - change that! */ - local_irq_enable(); - /* Preload FPU if it's enabled */ if (vcpu-arch.shared-msr MSR_FP) kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP); - ret = __kvmppc_vcpu_entry(kvm_run, vcpu); + kvm_guest_enter(); + + ret = __kvmppc_vcpu_run(kvm_run, vcpu); + + kvm_guest_exit(); local_irq_disable(); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 9066325..ee45fa0 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -312,6 +312,19 @@ void kvmppc_core_deliver_interrupts(struct kvm_vcpu *vcpu) vcpu-arch.shared-int_pending = 0; } +int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) +{ + int ret; + + local_irq_disable(); + kvm_guest_enter(); + ret = __kvmppc_vcpu_run(kvm_run, vcpu); + kvm_guest_exit(); + local_irq_enable(); + + return ret; +} + /** * kvmppc_handle_exit * diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 0c80e15..026036e 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -500,11 +500,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) kvmppc_core_deliver_interrupts(vcpu); - local_irq_disable(); - kvm_guest_enter(); - r = __kvmppc_vcpu_run(run, vcpu); - kvm_guest_exit(); - local_irq_enable(); + r = kvmppc_vcpu_run(run, vcpu); if (vcpu-sigset_active) sigprocmask(SIG_SETMASK, sigsaved, NULL); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/17] KVM: PPC: Deliver program interrupts right away instead of queueing them
Doing so means that we don't have to save the flags anywhere and gets rid of the last reference to to_book3s(vcpu) in arch/powerpc/kvm/book3s.c. Doing so is OK because a program interrupt won't be generated at the same time as any other synchronous interrupt. If a program interrupt and an asynchronous interrupt (external or decrementer) are generated at the same time, the program interrupt will be delivered, which is correct because it has a higher priority, and then the asynchronous interrupt will be masked. We don't ever generate system reset or machine check interrupts to the guest, but if we did, then we would need to make sure they got delivered rather than the program interrupt. The current code would be wrong in this situation anyway since it would deliver the program interrupt as well as the reset/machine check interrupt. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s.c |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 163e3e1..f68a34d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -129,8 +129,8 @@ void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, unsigned int vec) void kvmppc_core_queue_program(struct kvm_vcpu *vcpu, ulong flags) { - to_book3s(vcpu)-prog_flags = flags; - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_PROGRAM); + /* might as well deliver this straight away */ + kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_PROGRAM, flags); } void kvmppc_core_queue_dec(struct kvm_vcpu *vcpu) @@ -170,7 +170,6 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, unsigned int priority) { int deliver = 1; int vec = 0; - ulong flags = 0ULL; bool crit = kvmppc_critical_section(vcpu); switch (priority) { @@ -206,7 +205,6 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, unsigned int priority) break; case BOOK3S_IRQPRIO_PROGRAM: vec = BOOK3S_INTERRUPT_PROGRAM; - flags = to_book3s(vcpu)-prog_flags; break; case BOOK3S_IRQPRIO_VSX: vec = BOOK3S_INTERRUPT_VSX; @@ -237,7 +235,7 @@ int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, unsigned int priority) #endif if (deliver) - kvmppc_inject_interrupt(vcpu, vec, flags); + kvmppc_inject_interrupt(vcpu, vec, 0); return deliver; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/17] KVM: PPC: Handle some PAPR hcalls in the kernel
This adds the infrastructure for handling PAPR hcalls in the kernel, either early in the guest exit path while we are still in real mode, or later once the MMU has been turned back on and we are in the full kernel context. The advantage of handling hcalls in real mode if possible is that we avoid two partition switches -- and this will become more important when we support SMT4 guests, since a partition switch means we have to pull all of the threads in the core out of the guest. The disadvantage is that we can only access the kernel linear mapping, not anything vmalloced or ioremapped, since the MMU is off. This also adds code to handle the following hcalls in real mode: H_ENTER Add an HPTE to the hashed page table H_REMOVE Remove an HPTE from the hashed page table H_READRead HPTEs from the hashed page table H_PROTECT Change the protection bits in an HPTE H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table H_SET_DABRSet the data address breakpoint register Plus code to handle the following hcalls in the kernel: H_CEDEIdle the vcpu until an interrupt or H_PROD hcall arrives H_PRODWake up a ceded vcpu H_REGISTER_VPA Register a virtual processor area (VPA) The code that runs in real mode has to be in the base kernel, not in the module, if KVM is compiled as a module. The real-mode code can only access the kernel linear mapping, not vmalloc or ioremap space. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/hvcall.h |5 + arch/powerpc/include/asm/kvm_host.h | 11 + arch/powerpc/include/asm/kvm_ppc.h |1 + arch/powerpc/kernel/asm-offsets.c |2 + arch/powerpc/kvm/Makefile |8 +- arch/powerpc/kvm/book3s_hv.c| 170 ++- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 368 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 158 +- arch/powerpc/kvm/powerpc.c |2 +- 9 files changed, 718 insertions(+), 7 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_hv_rm_mmu.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index fd8201d..1c324ff 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -29,6 +29,10 @@ #define H_LONG_BUSY_ORDER_100_SEC 9905 /* Long busy, hint that 100sec \ is a good time to retry */ #define H_LONG_BUSY_END_RANGE 9905 /* End of long busy range */ + +/* Internal value used in book3s_hv kvm support; not returned to guests */ +#define H_TOO_HARD + #define H_HARDWARE -1 /* Hardware error */ #define H_FUNCTION -2 /* Function not supported */ #define H_PRIVILEGE-3 /* Caller not privileged */ @@ -100,6 +104,7 @@ #define H_PAGE_SET_ACTIVE H_PAGE_STATE_CHANGE #define H_AVPN (1UL(63-32)) /* An avpn is provided as a sanity test */ #define H_ANDCOND (1UL(63-33)) +#define H_LOCAL(1UL(63-35)) #define H_ICACHE_INVALIDATE(1UL(63-40)) /* icbi, etc. (ignored for IO pages) */ #define H_ICACHE_SYNCHRONIZE (1UL(63-41)) /* dcbst, icbi, etc (ignored for IO pages */ #define H_COALESCE_CAND(1UL(63-42)) /* page is a good candidate for coalescing */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 4a3f790..6ebf172 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -59,6 +59,10 @@ struct kvm; struct kvm_run; struct kvm_vcpu; +struct lppaca; +struct slb_shadow; +struct dtl; + struct kvm_vm_stat { u32 remote_tlb_flush; }; @@ -344,7 +348,14 @@ struct kvm_vcpu_arch { u64 dec_expires; unsigned long pending_exceptions; u16 last_cpu; + u8 ceded; + u8 prodded; u32 last_inst; + + struct lppaca *vpa; + struct slb_shadow *slb_shadow; + struct dtl *dtl; + struct dtl *dtl_end; int trap; struct kvm_vcpu_arch_shared *shared; unsigned long magic_page_pa; /* phys addr to map the magic page to */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0dafd53..2afe92e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -118,6 +118,7 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem); extern void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem); +extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern int kvmppc_core_init_vm(struct kvm *kvm); extern void kvmppc_core_destroy_vm(struct kvm *kvm); extern int kvmppc_core_prepare_memory_region(struct kvm *kvm, diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9362674..c70d106
[PATCH 01/17] KVM: PPC: Fix machine checks on 32-bit Book3S
Commit 69acc0d3ba (KVM: PPC: Resolve real-mode handlers through function exports) resulted in vcpu-arch.trampoline_lowmem and vcpu-arch.trampoline_enter ending up with kernel virtual addresses rather than physical addresses. This is OK on 64-bit Book3S machines, which ignore the top 4 bits of the effective address in real mode, but on 32-bit Book3S machines, accessing these addresses in real mode causes machine check interrupts, as the hardware uses the whole effective address as the physical address in real mode. This fixes the problem by using __pa() to convert these addresses to physical addresses. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 73fdab8..83500fb 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -28,6 +28,7 @@ #include asm/kvm_ppc.h #include asm/kvm_book3s.h #include asm/mmu_context.h +#include asm/page.h #include linux/gfp.h #include linux/sched.h #include linux/vmalloc.h @@ -1342,8 +1343,8 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) vcpu_book3s-slb_nr = 64; /* remember where some real-mode handlers are */ - vcpu-arch.trampoline_lowmem = (ulong)kvmppc_handler_lowmem_trampoline; - vcpu-arch.trampoline_enter = (ulong)kvmppc_handler_trampoline_enter; + vcpu-arch.trampoline_lowmem = __pa(kvmppc_handler_lowmem_trampoline); + vcpu-arch.trampoline_enter = __pa(kvmppc_handler_trampoline_enter); vcpu-arch.highmem_handler = (ulong)kvmppc_handler_highmem; #ifdef CONFIG_PPC_BOOK3S_64 vcpu-arch.rmcall = *(ulong*)kvmppc_rmcall; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/17] powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to indicate that we have a usable hypervisor mode, and another to indicate that the processor conforms to PowerISA version 2.06. We also add another bit to indicate that the processor conforms to ISA version 2.01 and set that for PPC970 and derivatives. Some PPC970 chips (specifically those in Apple machines) have a hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode is not useful in the sense that there is no way to run any code in supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1 bits in HID4 are always 0, and we use that as a way of detecting that hypervisor mode is not useful. Where we have a feature section in assembly code around code that only applies on POWER7 in hypervisor mode, we use a construct like END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) The definition of END_FTR_SECTION_IFSET is such that the code will be enabled (not overwritten with nops) only if all bits in the provided mask are set. Note that the CPU feature check in __tlbie() only needs to check the ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called if we are running bare-metal, i.e. in hypervisor mode. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/cputable.h| 14 -- arch/powerpc/include/asm/reg.h | 16 arch/powerpc/kernel/cpu_setup_power7.S |4 ++-- arch/powerpc/kernel/cpu_setup_ppc970.S | 26 ++ arch/powerpc/kernel/exceptions-64s.S |4 ++-- arch/powerpc/kernel/paca.c |2 +- arch/powerpc/kvm/book3s_64_mmu_hv.c|3 ++- arch/powerpc/kvm/book3s_hv.c |3 ++- arch/powerpc/kvm/book3s_hv_builtin.c |4 ++-- arch/powerpc/kvm/book3s_segment.S |2 +- arch/powerpc/mm/hash_native_64.c |4 ++-- 11 files changed, 56 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h index c0d842c..e30442c 100644 --- a/arch/powerpc/include/asm/cputable.h +++ b/arch/powerpc/include/asm/cputable.h @@ -179,8 +179,9 @@ extern const char *powerpc_base_platform; #define LONG_ASM_CONST(x) 0 #endif - -#define CPU_FTR_HVMODE_206 LONG_ASM_CONST(0x0008) +#define CPU_FTR_HVMODE LONG_ASM_CONST(0x0002) +#define CPU_FTR_ARCH_201 LONG_ASM_CONST(0x0004) +#define CPU_FTR_ARCH_206 LONG_ASM_CONST(0x0008) #define CPU_FTR_CFAR LONG_ASM_CONST(0x0010) #define CPU_FTR_IABR LONG_ASM_CONST(0x0020) #define CPU_FTR_MMCRA LONG_ASM_CONST(0x0040) @@ -401,9 +402,10 @@ extern const char *powerpc_base_platform; CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ | \ CPU_FTR_STCX_CHECKS_ADDRESS) #define CPU_FTRS_PPC970(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ - CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ + CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_201 | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \ - CPU_FTR_CP_USE_DCBTZ | CPU_FTR_STCX_CHECKS_ADDRESS) + CPU_FTR_CP_USE_DCBTZ | CPU_FTR_STCX_CHECKS_ADDRESS | \ + CPU_FTR_HVMODE) #define CPU_FTRS_POWER5(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_MMCRA | CPU_FTR_SMT | \ @@ -417,13 +419,13 @@ extern const char *powerpc_base_platform; CPU_FTR_DSCR | CPU_FTR_UNALIGNED_LD_STD | \ CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_CFAR) #define CPU_FTRS_POWER7 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ - CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_HVMODE_206 |\ + CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\ CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT | \ CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ - CPU_FTR_ICSWX | CPU_FTR_CFAR) + CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 20a053c..ddbe57a 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -307,6 +307,7 @@ #define SPRN_HASH1 0x3D2 /* Primary Hash Address Register */ #define SPRN_HASH2 0x3D3 /* Secondary Hash Address Resgister */ #define SPRN_HID0 0x3F0 /* Hardware Implementation Register 0 */ +#define HID0_HDICE_SH (63 - 23) /*
[RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate
This new ioctl allows userspace to specify what paravirtualization interface (if any) KVM should implement, what architecture version the guest virtual processors should conform to, and whether the guest can be permitted to use a real supervisor mode. At present the only effect of the ioctl is to indicate whether the requested emulation is available, but in future it may be used to select between different emulation techniques (book3s_pr vs. book3s_hv) or set the CPU compatibility mode for the guest. If book3s_pr KVM is enabled in the kernel config, then this new ioctl accepts platform values of KVM_PPC_PV_NONE and KVM_PPC_PV_KVM, but not KVM_PPC_PV_SPAPR. If book3s_hv KVM is enabled, then this ioctl requires that the platform is KVM_PPC_PV_SPAPR and the guest_arch field contains one of 201 or 206 (for architecture versions 2.01 and 2.06) -- when running on a PPC970, it must contain 201, and when running on a POWER7, it must contain 206. Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 35 +++ arch/powerpc/include/asm/kvm.h | 15 +++ arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/powerpc.c | 28 include/linux/kvm.h |1 + 5 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index b0e4b9c..3ab012c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_PPC_SET_PLATFORM + +Capability: none +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_ppc_set_platform (in) +Returns: 0, or -1 on error + +This is used by userspace to tell KVM what sort of platform it should +emulate. The return value of the ioctl tells userspace whether the +emulation it is requesting is supported by KVM. + +struct kvm_ppc_set_platform { + __u16 platform; /* defines the OS/hypervisor ABI */ + __u16 guest_arch; /* e.g. decimal 206 for v2.06 */ + __u32 flags; +}; + +/* Values for platform */ +#define KVM_PPC_PV_NONE0 /* bare-metal, non-paravirtualized */ +#define KVM_PPC_PV_KVM 1 /* as defined in kvm_para.h */ +#define KVM_PPC_PV_SPAPR 2 /* IBM Server PAPR (a la PowerVM) */ + +/* Values for flags */ +#define KVM_PPC_CROSS_ARCH 1 /* guest architecture != host */ + +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a +sufficiently different architecture to the host that the guest cannot +be permitted to use supervisor mode. For example, if the host is a +64-bit machine and the guest is a 32-bit machine, then this bit should +be set. + +The return value is 0 if KVM supports the requested emulation, or -1 +with errno == EINVAL if not. + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h index a4f6c85..0dd5cfb 100644 --- a/arch/powerpc/include/asm/kvm.h +++ b/arch/powerpc/include/asm/kvm.h @@ -287,4 +287,19 @@ struct kvm_allocate_rma { __u64 rma_size; }; +/* for KVM_PPC_SET_PLATFORM */ +struct kvm_ppc_set_platform { + __u16 platform; /* defines the OS/hypervisor ABI */ + __u16 guest_arch; /* e.g. decimal 206 for v2.06 */ + __u32 flags; +}; + +/* Values for platform */ +#define KVM_PPC_PV_NONE0 /* bare-metal, non-paravirtualized */ +#define KVM_PPC_PV_KVM 1 /* as defined in kvm_para.h */ +#define KVM_PPC_PV_SPAPR 2 /* IBM Server PAPR (a la PowerVM) */ + +/* Values for flags */ +#define KVM_PPC_CROSS_ARCH 1 /* guest architecture != host */ + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index cc22b28..00e7f1b 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -167,6 +167,7 @@ struct kvmppc_rma_info { }; struct kvm_arch { + struct kvm_ppc_set_platform platform; #ifdef CONFIG_KVM_BOOK3S_64_HV unsigned long hpt_virt; unsigned long ram_npages; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index a107c9b..83265cd 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -690,6 +690,34 @@ long kvm_arch_vm_ioctl(struct file *filp, break; } #endif /* CONFIG_KVM_BOOK3S_64_HV */ + case KVM_PPC_SET_PLATFORM: { + struct kvm_ppc_set_platform plat; + struct kvm *kvm = filp-private_data; + + r = -EFAULT; + if (copy_from_user(plat, argp,
[PATCH 08/17] powerpc: Set up LPCR for running guest partitions
In hypervisor mode, the LPCR controls several aspects of guest partitions, including virtual partition memory mode, and also controls whether the hypervisor decrementer interrupts are enabled. This sets up LPCR at boot time so that guest partitions will use a virtual real memory area (VRMA) composed of 16MB large pages, and hypervisor decrementer interrupts are disabled. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/reg.h |4 arch/powerpc/kernel/cpu_setup_power7.S | 18 +++--- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index c5cae0d..d879a6b 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -232,10 +232,12 @@ #define LPCR_VPM0(1ul (63-0)) #define LPCR_VPM1(1ul (63-1)) #define LPCR_ISL (1ul (63-2)) +#define LPCR_VC_SH (63-2) #define LPCR_DPFD_SH (63-11) #define LPCR_VRMA_L (1ul (63-12)) #define LPCR_VRMA_LP0(1ul (63-15)) #define LPCR_VRMA_LP1(1ul (63-16)) +#define LPCR_VRMASD_SH (63-16) #define LPCR_RMLS0x1C00 /* impl dependent rmo limit sel */ #define LPCR_ILE 0x0200 /* !HV irqs set MSR:LE */ #define LPCR_PECE0x7000 /* powersave exit cause enable */ @@ -243,8 +245,10 @@ #define LPCR_PECE1 0x2000 /* decrementer can cause exit */ #define LPCR_PECE2 0x1000 /* machine check etc can cause exit */ #define LPCR_MER 0x0800 /* Mediated External Exception */ +#define LPCR_LPES0x000c #define LPCR_LPES0 0x0008 /* LPAR Env selector 0 */ #define LPCR_LPES1 0x0004 /* LPAR Env selector 1 */ +#define LPCR_LPES_SH 2 #define LPCR_RMI 0x0002 /* real mode is cache inhibit */ #define LPCR_HDICE 0x0001 /* Hyp Decr enable (HV,PR,EE) */ #define SPRN_LPID 0x13F /* Logical Partition Identifier */ diff --git a/arch/powerpc/kernel/cpu_setup_power7.S b/arch/powerpc/kernel/cpu_setup_power7.S index 4f9a93f..2ef6749 100644 --- a/arch/powerpc/kernel/cpu_setup_power7.S +++ b/arch/powerpc/kernel/cpu_setup_power7.S @@ -61,19 +61,23 @@ __init_LPCR: * LPES = 0b01 (HSRR0/1 used for 0x500) * PECE = 0b111 * DPFD = 4 +* HDICE = 0 +* VC = 0b100 (VPM0=1, VPM1=0, ISL=0) +* VRMASD = 0b1 (L=1, LP=00) * * Other bits untouched for now */ mfspr r3,SPRN_LPCR - ori r3,r3,(LPCR_LPES0|LPCR_LPES1) - xorir3,r3, LPCR_LPES0 + li r5,1 + rldimi r3,r5, LPCR_LPES_SH, 64-LPCR_LPES_SH-2 ori r3,r3,(LPCR_PECE0|LPCR_PECE1|LPCR_PECE2) - li r5,7 - sldir5,r5,LPCR_DPFD_SH - andcr3,r3,r5 li r5,4 - sldir5,r5,LPCR_DPFD_SH - or r3,r3,r5 + rldimi r3,r5, LPCR_DPFD_SH, 64-LPCR_DPFD_SH-3 + clrrdi r3,r3,1 /* clear HDICE */ + li r5,4 + rldimi r3,r5, LPCR_VC_SH, 0 + li r5,0x10 + rldimi r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5 mtspr SPRN_LPCR,r3 isync blr -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/17] KVM: PPC: book3s_hv: Add support for PPC970-family processors
This adds support for running KVM guests in supervisor mode on those PPC970 processors that have a usable hypervisor mode. Unfortunately, Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to 1), but the YDL PowerStation does have a usable hypervisor mode. There are several differences between the PPC970 and POWER7 in how guests are managed. These differences are accommodated using the CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature bits. Notably, on PPC970: * The LPCR, LPID or RMOR registers don't exist, and the functions of those registers are provided by bits in HID4 and one bit in HID0. * External interrupts can be directed to the hypervisor, but unlike POWER7 they are masked by MSR[EE] in non-hypervisor modes and use SRR0/1 not HSRR0/1. * There is no virtual RMA (VRMA) mode; the guest must use an RMO (real mode offset) area. * The TLB entries are not tagged with the LPID, so it is necessary to flush the whole TLB on partition switch. Furthermore, when switching partitions we have to ensure that no other CPU is executing the tlbie or tlbsync instructions in either the old or the new partition, otherwise undefined behaviour can occur. * The PMU has 8 counters (PMC registers) rather than 6. * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist. * The SLB has 64 entries rather than 32. * There is no mediated external interrupt facility, so if we switch to a guest that has a virtual external interrupt pending but the guest has MSR[EE] = 0, we have to arrange to have an interrupt pending for it so that we can get control back once it re-enables interrupts. We do that by sending ourselves an IPI with smp_send_reschedule after hard-disabling interrupts. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/exception-64s.h |4 + arch/powerpc/include/asm/kvm_book3s_asm.h |2 +- arch/powerpc/include/asm/kvm_host.h |2 +- arch/powerpc/kernel/asm-offsets.c |1 + arch/powerpc/kernel/exceptions-64s.S |2 +- arch/powerpc/kvm/Kconfig | 13 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 30 +++- arch/powerpc/kvm/book3s_hv.c | 60 ++-- arch/powerpc/kvm/book3s_hv_builtin.c | 11 +- arch/powerpc/kvm/book3s_hv_interrupts.S | 30 arch/powerpc/kvm/book3s_hv_rm_mmu.c |6 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 230 - arch/powerpc/kvm/powerpc.c|3 + arch/powerpc/mm/hash_native_64.c |2 +- 14 files changed, 354 insertions(+), 42 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index 69435da..8057f4f 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -246,6 +246,10 @@ label##_hv: \ KVMTEST(vec); \ _SOFTEN_TEST(EXC_HV) +#define SOFTEN_TEST_HV_201(vec) \ + KVMTEST(vec); \ + _SOFTEN_TEST(EXC_STD) + #define __MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra) \ HMT_MEDIUM; \ SET_SCRATCH0(r13);/* save r13 */\ diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 9cfd543..ef7b368 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -82,7 +82,7 @@ struct kvmppc_host_state { unsigned long xics_phys; u64 dabr; u64 host_mmcr[3]; - u32 host_pmc[6]; + u32 host_pmc[8]; u64 host_purr; u64 host_spurr; u64 host_dscr; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f572d9c..cc22b28 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -353,7 +353,7 @@ struct kvm_vcpu_arch { u32 dbsr; u64 mmcr[3]; - u32 pmc[6]; + u32 pmc[8]; #ifdef CONFIG_KVM_EXIT_TIMING struct mutex exit_timing_lock; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index f4aba93..54b935f 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -128,6 +128,7 @@ int main(void) DEFINE(ICACHEL1LINESPERPAGE, offsetof(struct ppc64_caches, ilines_per_page)); /* paca */ DEFINE(PACA_SIZE, sizeof(struct paca_struct)); + DEFINE(PACA_LOCK_TOKEN, offsetof(struct paca_struct, lock_token)); DEFINE(PACAPACAINDEX, offsetof(struct paca_struct, paca_index)); DEFINE(PACAPROCSTART, offsetof(struct paca_struct, cpu_start)); DEFINE(PACAKSAVE, offsetof(struct
[PATCH 09/17] KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu
There are several fields in struct kvmppc_book3s_shadow_vcpu that temporarily store bits of host state while a guest is running, rather than anything relating to the particular guest or vcpu. This splits them out into a new kvmppc_host_state structure and modifies the definitions in asm-offsets.c to suit. On 32-bit, we have a kvmppc_host_state structure inside the kvmppc_book3s_shadow_vcpu since the assembly code needs to be able to get to them both with one pointer. On 64-bit they are separate fields in the PACA. This means that on 64-bit we don't need to copy the kvmppc_host_state in and out on vcpu load/unload, and in future will mean that the book3s_hv code doesn't need a shadow_vcpu struct in the PACA at all. That does mean that we have to be careful not to rely on any values persisting in the hstate field of the paca across any point where we could block or get preempted. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/exception-64s.h | 10 ++-- arch/powerpc/include/asm/kvm_book3s_asm.h | 27 ++--- arch/powerpc/include/asm/paca.h |1 + arch/powerpc/kernel/asm-offsets.c | 94 ++-- arch/powerpc/kernel/exceptions-64s.S |2 +- arch/powerpc/kvm/book3s_interrupts.S | 19 ++ arch/powerpc/kvm/book3s_rmhandlers.S | 18 +++--- arch/powerpc/kvm/book3s_segment.S | 76 --- 8 files changed, 127 insertions(+), 120 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index b6a3a44..296c9b6 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -96,16 +96,16 @@ EXCEPTION_PROLOG_PSERIES_1(label, h); #define __KVMTEST(n) \ - lbz r10,PACA_KVM_SVCPU+SVCPU_IN_GUEST(r13); \ + lbz r10,HSTATE_IN_GUEST(r13); \ cmpwi r10,0; \ bne do_kvm_##n #define __KVM_HANDLER(area, h, n) \ do_kvm_##n:\ ld r10,area+EX_R10(r13); \ - stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13); \ + stw r9,HSTATE_SCRATCH1(r13);\ ld r9,area+EX_R9(r13); \ - std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \ + std r12,HSTATE_SCRATCH0(r13); \ li r12,n; \ b kvmppc_interrupt @@ -114,9 +114,9 @@ do_kvm_##n: \ cmpwi r10,KVM_GUEST_MODE_SKIP;\ ld r10,area+EX_R10(r13); \ beq 89f;\ - stw r9,PACA_KVM_SVCPU+SVCPU_SCRATCH1(r13); \ + stw r9,HSTATE_SCRATCH1(r13);\ ld r9,area+EX_R9(r13); \ - std r12,PACA_KVM_SVCPU+SVCPU_SCRATCH0(r13); \ + std r12,HSTATE_SCRATCH0(r13); \ li r12,n; \ b kvmppc_interrupt; \ 89:mtocrf 0x80,r9;\ diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index d5a8a38..3126175 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -60,6 +60,22 @@ kvmppc_resume_\intno: #else /*__ASSEMBLY__ */ +/* + * This struct goes in the PACA on 64-bit processors. It is used + * to store host state that needs to be saved when we enter a guest + * and restored when we exit, but isn't specific to any particular + * guest or vcpu. It also has some scratch fields used by the guest + * exit code. + */ +struct kvmppc_host_state { + ulong host_r1; + ulong host_r2; + ulong vmhandler; + ulong scratch0; + ulong scratch1; + u8 in_guest; +}; + struct kvmppc_book3s_shadow_vcpu { ulong gpr[14]; u32 cr; @@ -73,17 +89,12 @@ struct kvmppc_book3s_shadow_vcpu { ulong shadow_srr1; ulong fault_dar; - ulong host_r1; - ulong host_r2; - ulong handler; - ulong scratch0; - ulong scratch1; - ulong vmhandler; - u8 in_guest; - #ifdef CONFIG_PPC_BOOK3S_32 u32 sr[16]; /* Guest SRs */ + + struct kvmppc_host_state hstate; #endif + #ifdef CONFIG_PPC_BOOK3S_64 u8 slb_max; /*
[PATCH 0/17] Hypervisor-mode KVM on POWER7 and PPC970
The first patch of the following series is a pure bug-fix for 32-bit kernels. The remainder of the following series of patches enable KVM to exploit the hardware hypervisor mode on 64-bit Power ISA Book3S machines. At present, POWER7 and PPC970 processors are supported. (Note that the PPC970 processors in Apple G5 machines don't have a usable hypervisor mode and are not supported by these patches.) Running the KVM host in hypervisor mode means that the guest can use both supervisor mode and user mode. That means that the guest can execute supervisor-privilege instructions and access supervisor- privilege registers. In addition the hardware directs most exceptions to the guest. Thus we don't need to emulate any instructions in the host. Generally, the only times we need to exit the guest are when it does a hypercall or when an external interrupt or host timer (decrementer) interrupt occurs. The focus of this KVM implementation is to run guests that use the PAPR (Power Architecture Platform Requirements) paravirtualization interface, which is the interface supplied by PowerVM on IBM pSeries machines. Currently the pseries machine type in qemu is only supported by book3s_hv KVM, and book3s_hv KVM only supports the pseries machine type. That will hopefully change in future. These patches are against the master branch of the kvm tree. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/17] KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode
From: David Gibson d...@au1.ibm.com This improves I/O performance for guests using the PAPR paravirtualization interface by making the H_PUT_TCE hcall faster, by implementing it in real mode. H_PUT_TCE is used for updating virtual IOMMU tables, and is used both for virtual I/O and for real I/O in the PAPR interface. Since this moves the IOMMU tables into the kernel, we define a new KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The ioctl returns a file descriptor which can be used to mmap the newly created table. The qemu driver models use them in the same way as userspace managed tables, but they can be updated directly by the guest with a real-mode H_PUT_TCE implementation, reducing the number of host/guest context switches during guest IO. There are certain circumstances where it is useful for userland qemu to write to the TCE table even if the kernel H_PUT_TCE path is used most of the time. Specifically, allowing this will avoid awkwardness when we need to reset the table. More importantly, we will in the future need to write the table in order to restore its state after a checkpoint resume or migration. Signed-off-by: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt| 35 + arch/powerpc/include/asm/kvm.h |9 +++ arch/powerpc/include/asm/kvm_book3s_64.h |2 + arch/powerpc/include/asm/kvm_host.h |9 +++ arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/kvm/Makefile|3 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 73 +++ arch/powerpc/kvm/book3s_hv.c | 116 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S |2 +- arch/powerpc/kvm/powerpc.c | 18 + include/linux/kvm.h |2 + 11 files changed, 268 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index e8875fe..a1d344d 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1350,6 +1350,41 @@ The following flags are defined: If datamatch flag is set, the event will be signaled only if the written value to the registered address is equal to datamatch in struct kvm_ioeventfd. +4.62 KVM_CREATE_SPAPR_TCE + +Capability: KVM_CAP_SPAPR_TCE +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce (in) +Returns: file descriptor for manipulating the created TCE table + +This creates a virtual TCE (translation control entry) table, which +is an IOMMU for PAPR-style virtual I/O. It is used to translate +logical addresses used in virtual I/O into guest physical addresses, +and provides a scatter/gather capability for PAPR virtual I/O. + +/* for KVM_CAP_SPAPR_TCE */ +struct kvm_create_spapr_tce { + __u64 liobn; + __u32 window_size; +}; + +The liobn field gives the logical IO bus number for which to create a +TCE table. The window_size field specifies the size of the DMA window +which this TCE table will translate - the table will contain one 64 +bit TCE entry for every 4kiB of the DMA window. + +When the guest issues an H_PUT_TCE hcall on a liobn for which a TCE +table has been created using this ioctl(), the kernel will handle it +in real mode, updating the TCE table. H_PUT_TCE calls for other +liobns will cause a vm exit and must be handled by userspace. + +The return value is a file descriptor which can be passed to mmap(2) +to map the created TCE table into userspace. This lets userspace read +the entries written by kernel-handled H_PUT_TCE calls, and also lets +userspace update the TCE table directly which is useful in some +circumstances. + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h index d2ca5ed..c3ec990 100644 --- a/arch/powerpc/include/asm/kvm.h +++ b/arch/powerpc/include/asm/kvm.h @@ -22,6 +22,9 @@ #include linux/types.h +/* Select powerpc specific features in linux/kvm.h */ +#define __KVM_HAVE_SPAPR_TCE + struct kvm_regs { __u64 pc; __u64 cr; @@ -272,4 +275,10 @@ struct kvm_guest_debug_arch { #define KVM_INTERRUPT_UNSET-2U #define KVM_INTERRUPT_SET_LEVEL-3U +/* for KVM_CAP_SPAPR_TCE */ +struct kvm_create_spapr_tce { + __u64 liobn; + __u32 window_size; +}; + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 5f73388..e43fe42 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -27,4 +27,6 @@ static inline struct kvmppc_book3s_shadow_vcpu *to_svcpu(struct kvm_vcpu *vcpu) } #endif +#define SPAPR_TCE_SHIFT12 + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git
[PATCH 13/17] KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore-entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 13 ++ arch/powerpc/include/asm/kvm.h|1 + arch/powerpc/include/asm/kvm_book3s_asm.h |2 + arch/powerpc/include/asm/kvm_host.h | 46 - arch/powerpc/include/asm/kvm_ppc.h| 13 ++ arch/powerpc/kernel/asm-offsets.c |6 + arch/powerpc/kernel/exceptions-64s.S | 31 ++- arch/powerpc/kernel/idle_power7.S |2 - arch/powerpc/kvm/book3s_hv.c | 316 ++--- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 168 +++- arch/powerpc/kvm/powerpc.c|4 + arch/powerpc/sysdev/xics/icp-native.c |9 + include/linux/kvm.h |1 + 13 files changed, 567 insertions(+), 45 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index a1d344d..6818713 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -180,6 +180,19 @@ KVM_CHECK_EXTENSION ioctl() to determine the value for max_vcpus at run-time. If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4 cpus max. +On powerpc using book3s_hv mode, the vcpus are mapped onto virtual +threads in one or more virtual CPU cores. (This is because the +hardware requires all the hardware threads in a CPU core to be in the +same partition.) The KVM_CAP_PPC_SMT capability indicates the number +of vcpus per virtual core (vcore). The vcore id is obtained by +dividing the vcpu id by the number of vcpus per vcore. The vcpus in a +given vcore will always be in the same physical core as each other +(though that might be a different physical core from time to time). +Userspace can control the threading (SMT) mode of the guest by its
[PATCH 02/17] KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s
This moves the slb field, which represents the state of the emulated SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s. This is in accord with the principle that the kvm_vcpu_arch struct represents the state of the emulated CPU, and the kvmppc_vcpu_book3s struct holds the auxiliary data structures used in the emulation. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h | 35 +--- arch/powerpc/include/asm/kvm_host.h | 34 +++- arch/powerpc/kvm/book3s.c |9 ++-- arch/powerpc/kvm/book3s_64_mmu.c | 54 +++- arch/powerpc/kvm/book3s_mmu_hpte.c| 71 +++- arch/powerpc/kvm/trace.h |2 +- 6 files changed, 107 insertions(+), 98 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 70c409b..f7b2baf 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -24,20 +24,6 @@ #include linux/kvm_host.h #include asm/kvm_book3s_asm.h -struct kvmppc_slb { - u64 esid; - u64 vsid; - u64 orige; - u64 origv; - bool valid : 1; - bool Ks : 1; - bool Kp : 1; - bool nx : 1; - bool large : 1;/* PTEs are 16MB */ - bool tb : 1;/* 1TB segment */ - bool class : 1; -}; - struct kvmppc_bat { u64 raw; u32 bepi; @@ -67,11 +53,22 @@ struct kvmppc_sid_map { #define VSID_POOL_SIZE (SID_CONTEXTS * 16) #endif +struct hpte_cache { + struct hlist_node list_pte; + struct hlist_node list_pte_long; + struct hlist_node list_vpte; + struct hlist_node list_vpte_long; + struct rcu_head rcu_head; + u64 host_va; + u64 pfn; + ulong slot; + struct kvmppc_pte pte; +}; + struct kvmppc_vcpu_book3s { struct kvm_vcpu vcpu; struct kvmppc_book3s_shadow_vcpu *shadow_vcpu; struct kvmppc_sid_map sid_map[SID_MAP_NUM]; - struct kvmppc_slb slb[64]; struct { u64 esid; u64 vsid; @@ -81,7 +78,6 @@ struct kvmppc_vcpu_book3s { struct kvmppc_bat dbat[8]; u64 hid[6]; u64 gqr[8]; - int slb_nr; u64 sdr1; u64 hior; u64 msr_mask; @@ -94,6 +90,13 @@ struct kvmppc_vcpu_book3s { #endif int context_id[SID_CONTEXTS]; ulong prog_flags; /* flags to inject when giving a 700 trap */ + + struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE]; + struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG]; + struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE]; + struct hlist_head hpte_hash_vpte_long[HPTEG_HASH_NUM_VPTE_LONG]; + int hpte_cache_count; + spinlock_t mmu_lock; }; #define CONTEXT_HOST 0 diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6e05b2d..069eb9f 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -163,16 +163,18 @@ struct kvmppc_mmu { bool (*is_dcbz32)(struct kvm_vcpu *vcpu); }; -struct hpte_cache { - struct hlist_node list_pte; - struct hlist_node list_pte_long; - struct hlist_node list_vpte; - struct hlist_node list_vpte_long; - struct rcu_head rcu_head; - u64 host_va; - u64 pfn; - ulong slot; - struct kvmppc_pte pte; +struct kvmppc_slb { + u64 esid; + u64 vsid; + u64 orige; + u64 origv; + bool valid : 1; + bool Ks : 1; + bool Kp : 1; + bool nx : 1; + bool large : 1;/* PTEs are 16MB */ + bool tb : 1;/* 1TB segment */ + bool class : 1; }; struct kvm_vcpu_arch { @@ -187,6 +189,9 @@ struct kvm_vcpu_arch { ulong highmem_handler; ulong rmcall; ulong host_paca_phys; + struct kvmppc_slb slb[64]; + int slb_max;/* # valid entries in slb[] */ + int slb_nr; /* total number of entries in SLB */ struct kvmppc_mmu mmu; #endif @@ -305,15 +310,6 @@ struct kvm_vcpu_arch { struct kvm_vcpu_arch_shared *shared; unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ - -#ifdef CONFIG_PPC_BOOK3S - struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE]; - struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG]; - struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE]; - struct hlist_head hpte_hash_vpte_long[HPTEG_HASH_NUM_VPTE_LONG]; - int hpte_cache_count; - spinlock_t mmu_lock; -#endif }; #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/kvm/book3s.c
[PATCH 14/17] KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to run on older POWER processors, including PPC970, which don't support the Virtual Real Mode Area (VRMA) facility, but only the Real Mode Offset (RMO) facility. These processors require a physically contiguous, aligned area of memory for each guest. When the guest does an access in real mode (MMU off), the address is compared against a limit value, and if it is lower, the address is ORed with an offset value (from the Real Mode Offset Register (RMOR)) and the result becomes the real address for the access. The size of the RMA has to be one of a set of supported values, which usually includes 64MB, 128MB, 256MB and some larger powers of 2. Since we are unlikely to be able to allocate 64MB or more of physically contiguous memory after the kernel has been running for a while, we allocate a pool of RMAs at boot time using the bootmem allocator. The size and number of the RMAs can be set using the kvm_rma_size=xx and kvm_rma_count=xx kernel command line options. KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability of the pool of preallocated RMAs. The capability value is 1 if the processor can use an RMA but doesn't require one (because it supports the VRMA facility), or 2 if the processor requires an RMA for each guest. This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the pool and returns a file descriptor which can be used to map the RMA. It also returns the size of the RMA in the argument structure. Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION ioctl calls from userspace. To cope with this, we now preallocate the kvm-arch.ram_pginfo array when the VM is created with a size sufficient for up to 64GB of guest memory. Subsequently we will get rid of this array and use memory associated with each memslot instead. This moves most of the code that translates the user addresses into host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level to kvmppc_core_prepare_memory_region. Also, instead of having to look up the VMA for each page in order to check the page size, we now check that the pages we get are compound pages of 16MB. However, if we are adding memory that is mapped to an RMA, we don't bother with calling get_user_pages_fast and instead just offset from the base pfn for the RMA. Typically the RMA gets added after vcpus are created, which makes it inconvenient to have the LPCR (logical partition control register) value in the vcpu-arch struct, since the LPCR controls whether the processor uses RMA or VRMA for the guest. This moves the LPCR value into the kvm-arch struct and arranges for the MER (mediated external request) bit, which is the only bit that varies between vcpus, to be set in assembly code when going into the guest if there is a pending external interrupt request. Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 32 arch/powerpc/include/asm/kvm.h |5 + arch/powerpc/include/asm/kvm_book3s.h |8 - arch/powerpc/include/asm/kvm_host.h | 15 ++- arch/powerpc/include/asm/kvm_ppc.h | 10 ++ arch/powerpc/include/asm/reg.h |1 + arch/powerpc/kernel/asm-offsets.c |4 +- arch/powerpc/kernel/setup_64.c |3 + arch/powerpc/kvm/Makefile |3 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 97 +--- arch/powerpc/kvm/book3s_hv.c| 259 +- arch/powerpc/kvm/book3s_hv_builtin.c| 152 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++- arch/powerpc/kvm/powerpc.c | 13 ++ include/linux/kvm.h |3 + 15 files changed, 505 insertions(+), 119 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_hv_builtin.c diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6818713..b0e4b9c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1398,6 +1398,38 @@ the entries written by kernel-handled H_PUT_TCE calls, and also lets userspace update the TCE table directly which is useful in some circumstances. +4.63 KVM_ALLOCATE_RMA + +Capability: KVM_CAP_PPC_RMA +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_allocate_rma (out) +Returns: file descriptor for mapping the allocated RMA + +This allocates a Real Mode Area (RMA) from the pool allocated at boot +time by the kernel. An RMA is a physically-contiguous, aligned region +of memory used on older POWER processors to provide the memory which +will be accessed by real-mode (MMU off) accesses in a KVM guest. +POWER processors support a set of sizes for the RMA that usually +includes 64MB, 128MB, 256MB and some larger powers of two. + +/* for KVM_ALLOCATE_RMA */ +struct kvm_allocate_rma { + __u64 rma_size; +}; + +The return value is a file descriptor which can be passed to mmap(2)
Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
On 06/29/2011 04:21 PM, Avi Kivity wrote: -if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception) -== X86EMUL_CONTINUE) +if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) return X86EMUL_CONTINUE; This doesn't perform the cpl check. Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl is checked in this function, it is not enough? I suggest dropping this part for now and doing it later. OK, i will post this part in the separate patchset. :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On 06/29/2011 04:24 PM, Avi Kivity wrote: +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + +if (write) +access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. This function is only used by read/write emulator, execute permission is not needed for read/write, no? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
On 06/29/2011 04:37 PM, Avi Kivity wrote: +struct read_write_emulator_ops { +int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val, + int bytes); +int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, + void *val, int bytes); +int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, + int bytes, void *val); +int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa, +void *val, int bytes); +bool write; +}; Interesting! This structure combines two unrelated operations, though. One is the internals of the iteration on a virtual address that is split to various physical addresses. The other is the interaction with userspace on mmio exits. They should be split, but I think it's fine to do it in a later patch. This series is long enough already. I was also annoyed by the duplication. They way I thought of fixing it is having gva_to_gpa() return two gpas, and having the access function accept gpa vectors. The reason was so that we can implemented locked cross-page operations (which we now emulate as unlocked writes). But I think we can do without it, and instead emulated locked cross-page ops by stalling all other vcpus while we write, or by unmapping the pages involved. It isn't pretty but it doesn't need to be fast since it's a very rare operation. So I think we can go with your approach. OK, i'll post it in the separate patchset, thanks, Avi. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
On 06/29/2011 04:48 PM, Avi Kivity wrote: On 06/22/2011 05:31 PM, Xiao Guangrong wrote: If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Does this work if the mmio spans two pages? If the mmio spans two pages, we already split the emulation into two parts, and the mmio cache info is only matched for one page, so i thinks it works well :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On 06/29/2011 01:56 PM, Xiao Guangrong wrote: On 06/29/2011 04:24 PM, Avi Kivity wrote: +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + +if (write) +access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. This function is only used by read/write emulator, execute permission is not needed for read/write, no? It's not good to have a function which only implements the functionality partially. It can later be misused. You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
On 06/29/2011 02:09 PM, Xiao Guangrong wrote: On 06/29/2011 04:48 PM, Avi Kivity wrote: On 06/22/2011 05:31 PM, Xiao Guangrong wrote: If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Does this work if the mmio spans two pages? If the mmio spans two pages, we already split the emulation into two parts, and the mmio cache info is only matched for one page, so i thinks it works well :-) Ok, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
On 06/29/2011 05:16 PM, Avi Kivity wrote: On 06/22/2011 05:35 PM, Xiao Guangrong wrote: Use rcu to protect shadow pages table to be freed, so we can safely walk it, it should run fastly and is needed by mmio page fault static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) { @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); +if (atomic_read(kvm-arch.reader_counter)) { +kvm_mmu_isolate_pages(invalid_list); +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); +list_del_init(invalid_list); +call_rcu(sp-rcu, free_pages_rcu); +return; +} + I think we should do this unconditionally. The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts. On the other hand, zap_page is very rare, so it can be a little slower. Also, less code paths = easier to understand. On soft mmu, zap_page is very frequently, it can cause performance regression in my test. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
On 06/29/2011 02:16 PM, Xiao Guangrong wrote: @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); +if (atomic_read(kvm-arch.reader_counter)) { +kvm_mmu_isolate_pages(invalid_list); +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); +list_del_init(invalid_list); +call_rcu(sp-rcu, free_pages_rcu); +return; +} + I think we should do this unconditionally. The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts. On the other hand, zap_page is very rare, so it can be a little slower. Also, less code paths = easier to understand. On soft mmu, zap_page is very frequently, it can cause performance regression in my test. Any idea what the cause of the regression is? It seems to me that simply deferring freeing shouldn't have a large impact. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
On 06/29/2011 01:53 PM, Xiao Guangrong wrote: On 06/29/2011 04:21 PM, Avi Kivity wrote: -if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception) -== X86EMUL_CONTINUE) +if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes)) return X86EMUL_CONTINUE; This doesn't perform the cpl check. Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl is checked in this function, it is not enough? You are right, it is enough. I don't know how I missed it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On 06/29/2011 07:09 PM, Avi Kivity wrote: On 06/29/2011 01:56 PM, Xiao Guangrong wrote: On 06/29/2011 04:24 PM, Avi Kivity wrote: +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + +if (write) +access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. This function is only used by read/write emulator, execute permission is not needed for read/write, no? It's not good to have a function which only implements the functionality partially. It can later be misused. You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler. Actually, we will get the cache mmio info in this function, i think it is pure waste for other access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On 06/29/2011 02:26 PM, Xiao Guangrong wrote: On 06/29/2011 07:09 PM, Avi Kivity wrote: On 06/29/2011 01:56 PM, Xiao Guangrong wrote: On 06/29/2011 04:24 PM, Avi Kivity wrote: +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + +if (write) +access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. This function is only used by read/write emulator, execute permission is not needed for read/write, no? It's not good to have a function which only implements the functionality partially. It can later be misused. You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler. Actually, we will get the cache mmio info in this function, i think it is pure waste for other access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio? Not too happy, but ok. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: missing compat-ioctl for CDROM_DRIVE_STATUS + FDGETPRM
On Fri, Jun 17, 2011 at 03:02:39PM +0200, Arnd Bergmann wrote: On Friday 17 June 2011 11:04:24 Johannes Stezenbach wrote: running even a simple qemu-img create -f qcow2 some.img 1G causes the following in dmesg on a Linux host with linux-2.6.39.1 x86_64 kernel and 32bit userspace: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(5326){t:'S';sz:0} arg(7fff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img (The same happens when starting a qemu or kvm vm.) ioctl 5326 seems to be CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. Both are used in qemu/block/raw-posix.c in cdrom_probe_device() and floppy_probe_device() respectively. FWIW, I'm using qemu/kvm from Debian unstable (qemu-0.14.0+dfsg-5.1, qemu-kvm-0.14.1+dfsg-1) Both are handled by the kernel for block devices, but not for regular files. The messages may be annoying but they are harmless. We could silence them either by checking if the file is actually a block device in qemu-img, or by adding a nop handler to the kernel for regular files. Sorry for very slow reply. I think qemu's use of these ioctls to probe if the device is a cdrom or floppy is valid, so instead of adding a stat() call to check for block device in qemu, I think it is better to silence the warning in the kernel. Do I get it right that just adding two IGNORE_IOCTL() to the ioctl_pointer array in linux/fs/compat_ioctl.c is sufficient, like in commit 3f001711? I.e. these ioctls are handled for block devices earlier in compat_sys_ioctl()? Thanks, Johannes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fix for Linux 3.0-rc5
Linus, please pull from are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.0 To receive a single KVM fix. Emulated instructions which had both an immediate operand and an %rip-relative operand did not compute the effective address correctly; this is now fixed. Avi Kivity (1): KVM: x86 emulator: fix %rip-relative addressing with immediate source operand arch/x86/kvm/emulate.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
On 06/29/2011 07:18 PM, Avi Kivity wrote: On 06/29/2011 02:16 PM, Xiao Guangrong wrote: @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); +if (atomic_read(kvm-arch.reader_counter)) { +kvm_mmu_isolate_pages(invalid_list); +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link); +list_del_init(invalid_list); +call_rcu(sp-rcu, free_pages_rcu); +return; +} + I think we should do this unconditionally. The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts. On the other hand, zap_page is very rare, so it can be a little slower. Also, less code paths = easier to understand. On soft mmu, zap_page is very frequently, it can cause performance regression in my test. Any idea what the cause of the regression is? It seems to me that simply deferring freeing shouldn't have a large impact. I guess it is because the page is freed too frequently, i have done the test, it shows about 3219 pages is freed per second Kernbench performance comparing: the origin way: 3m27.723 free all shadow page in rcu context: 3m30.519 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
On Wed, Jun 29, 2011 at 02:26:14PM +0300, Avi Kivity wrote: On 06/29/2011 02:26 PM, Xiao Guangrong wrote: On 06/29/2011 07:09 PM, Avi Kivity wrote: On 06/29/2011 01:56 PM, Xiao Guangrong wrote: On 06/29/2011 04:24 PM, Avi Kivity wrote: +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, + gpa_t *gpa, struct x86_exception *exception, + bool write) +{ +u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + +if (write) +access |= PFERR_WRITE_MASK; Needs fetch as well so NX/SMEP can work. This function is only used by read/write emulator, execute permission is not needed for read/write, no? It's not good to have a function which only implements the functionality partially. It can later be misused. You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler. Actually, we will get the cache mmio info in this function, i think it is pure waste for other access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio? Not too happy, but ok. I do plan to add fetching from MMIO. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate
On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote: Documentation/virtual/kvm/api.txt | 35 +++ arch/powerpc/include/asm/kvm.h | 15 +++ arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/powerpc.c | 28 include/linux/kvm.h |1 + 5 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index b0e4b9c..3ab012c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_PPC_SET_PLATFORM + +Capability: none +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_ppc_set_platform (in) +Returns: 0, or -1 on error + +This is used by userspace to tell KVM what sort of platform it should +emulate. The return value of the ioctl tells userspace whether the +emulation it is requesting is supported by KVM. + +struct kvm_ppc_set_platform { + __u16 platform; /* defines the OS/hypervisor ABI */ + __u16 guest_arch; /* e.g. decimal 206 for v2.06 */ + __u32 flags; +}; + +/* Values for platform */ +#define KVM_PPC_PV_NONE 0 /* bare-metal, non-paravirtualized */ +#define KVM_PPC_PV_KVM1 /* as defined in kvm_para.h */ +#define KVM_PPC_PV_SPAPR 2 /* IBM Server PAPR (a la PowerVM) */ + +/* Values for flags */ +#define KVM_PPC_CROSS_ARCH1 /* guest architecture != host */ + +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a +sufficiently different architecture to the host that the guest cannot +be permitted to use supervisor mode. For example, if the host is a +64-bit machine and the guest is a 32-bit machine, then this bit should +be set. This makes me wonder if a similar thing might eventually be usable for running an i686 or x32 guest on an x86_64 KVM host. I have no idea if that is even theoretically possible, but if it is it might be better to rename the ioctl to be architecture agnostic. josh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate
On 29.06.2011, at 13:53, Josh Boyer wrote: On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote: Documentation/virtual/kvm/api.txt | 35 +++ arch/powerpc/include/asm/kvm.h | 15 +++ arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/powerpc.c | 28 include/linux/kvm.h |1 + 5 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index b0e4b9c..3ab012c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_PPC_SET_PLATFORM + +Capability: none +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_ppc_set_platform (in) +Returns: 0, or -1 on error + +This is used by userspace to tell KVM what sort of platform it should +emulate. The return value of the ioctl tells userspace whether the +emulation it is requesting is supported by KVM. + +struct kvm_ppc_set_platform { +__u16 platform; /* defines the OS/hypervisor ABI */ +__u16 guest_arch; /* e.g. decimal 206 for v2.06 */ +__u32 flags; +}; + +/* Values for platform */ +#define KVM_PPC_PV_NONE 0 /* bare-metal, non-paravirtualized */ +#define KVM_PPC_PV_KVM 1 /* as defined in kvm_para.h */ +#define KVM_PPC_PV_SPAPR2 /* IBM Server PAPR (a la PowerVM) */ + +/* Values for flags */ +#define KVM_PPC_CROSS_ARCH 1 /* guest architecture != host */ + +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a +sufficiently different architecture to the host that the guest cannot +be permitted to use supervisor mode. For example, if the host is a +64-bit machine and the guest is a 32-bit machine, then this bit should +be set. This makes me wonder if a similar thing might eventually be usable for running an i686 or x32 guest on an x86_64 KVM host. I have no idea if that is even theoretically possible, but if it is it might be better to rename the ioctl to be architecture agnostic. On x86 this is not required unless we want to virtualize pre-CPUID CPUs. Everything as of Pentium has a full bitmap of feature capabilities that KVM gets from user space, including information such as Can we do 64-bit mode?. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 17/17] KVM: PPC: Add an ioctl for userspace to select which platform to emulate
On Wed, Jun 29, 2011 at 01:56:16PM +0200, Alexander Graf wrote: On 29.06.2011, at 13:53, Josh Boyer wrote: On Wed, Jun 29, 2011 at 08:41:03PM +1000, Paul Mackerras wrote: Documentation/virtual/kvm/api.txt | 35 +++ arch/powerpc/include/asm/kvm.h | 15 +++ arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kvm/powerpc.c | 28 include/linux/kvm.h |1 + 5 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index b0e4b9c..3ab012c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1430,6 +1430,41 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_PPC_SET_PLATFORM + +Capability: none +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_ppc_set_platform (in) +Returns: 0, or -1 on error + +This is used by userspace to tell KVM what sort of platform it should +emulate. The return value of the ioctl tells userspace whether the +emulation it is requesting is supported by KVM. + +struct kvm_ppc_set_platform { + __u16 platform; /* defines the OS/hypervisor ABI */ + __u16 guest_arch; /* e.g. decimal 206 for v2.06 */ + __u32 flags; +}; + +/* Values for platform */ +#define KVM_PPC_PV_NONE0 /* bare-metal, non-paravirtualized */ +#define KVM_PPC_PV_KVM 1 /* as defined in kvm_para.h */ +#define KVM_PPC_PV_SPAPR 2 /* IBM Server PAPR (a la PowerVM) */ + +/* Values for flags */ +#define KVM_PPC_CROSS_ARCH 1 /* guest architecture != host */ + +The KVM_PPC_CROSS_ARCH bit being 1 indicates that the guest is of a +sufficiently different architecture to the host that the guest cannot +be permitted to use supervisor mode. For example, if the host is a +64-bit machine and the guest is a 32-bit machine, then this bit should +be set. This makes me wonder if a similar thing might eventually be usable for running an i686 or x32 guest on an x86_64 KVM host. I have no idea if that is even theoretically possible, but if it is it might be better to rename the ioctl to be architecture agnostic. On x86 this is not required unless we want to virtualize pre-CPUID CPUs. Everything as of Pentium has a full bitmap of feature capabilities that KVM gets from user space, including information such as Can we do 64-bit mode?. Ah. Thank you for the explanation. josh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] perf support for amd guest/host-only bits v2
On Wed, Jun 29, 2011 at 11:02:54AM +0200, Peter Zijlstra wrote: On Tue, 2011-06-28 at 18:10 +0200, Joerg Roedel wrote: On Fri, Jun 17, 2011 at 03:37:29PM +0200, Joerg Roedel wrote: this is the second version of the patch-set to support the AMD guest-/host only bits in the performance counter MSRs. Due to lack of time I havn't looked into emulating support for this feature on Intel or other architectures, but the other comments should be worked in. The changes to v1 include: * Rebased patches to v3.0-rc3 * Allow exclude_guest and exclude_host set at the same time * Reworked event-parse logic for the new exclude-bits * Only count guest-events per default from perf-kvm Hi Peter, Ingo, have you had a chance to look at this patch-set? Are any changes required? I would feel a lot more comfortable by having it implemented on all of x86 as well as at least one !x86 platform. Avi graciously volunteered for the Intel bits. Paulus, I hear from benh that you're also responsible for the ppc-kvm bits, could you possibly find some time to implement this feature for ppc? I'll have a look at it, but I don't know how quickly I'll be able to produce a patch. We have two styles of KVM on PowerPC (at least as far as server processors are concerned), one where the guest runs entirely in usermode and the privileged facilities are emulated, and another that uses hypervisor mode in the host and can allow the guest to use supervisor mode. In the latter case, the PMU is considered a guest resource, that is, the hardware allows the guest to manipulate the PMU directly, and PMU interrupts go directly to the guest. In that mode it's not really possible to count or profile guest activity from the host. There are some hypervisor-only counters in the PMU but they have limited event selection compared to the counters available to the guest. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
On 06/29/2011 02:50 PM, Xiao Guangrong wrote: I think we should do this unconditionally. The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts. On the other hand, zap_page is very rare, so it can be a little slower. Also, less code paths = easier to understand. On soft mmu, zap_page is very frequently, it can cause performance regression in my test. Any idea what the cause of the regression is? It seems to me that simply deferring freeing shouldn't have a large impact. I guess it is because the page is freed too frequently, i have done the test, it shows about 3219 pages is freed per second Kernbench performance comparing: the origin way: 3m27.723 free all shadow page in rcu context: 3m30.519 I don't recall seeing such a high free rate. Who is doing all this zapping? You may be able to find out with the function tracer + call graph. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html