On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> * Joerg Roedel <j...@8bytes.org> wrote:
> 
> > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > > 
> > > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > > >    host.
> > > > 
> > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > > configured to count only when in guest mode. Perf needs to be aware of
> > > > that and fetch the rip from a different place when monitoring a guest.
> > 
> > > The idea is we want to measure both host and guest at the same time, and
> > > compare all the hot functions fairly.
> > 
> > So you want to measure while the guest vcpu is running and the vmexit
> > path of that vcpu (including qemu userspace part) together? The
> > challenge here is to find out if a performance event originated in guest
> > mode or in host mode.
> > But we can check for that in the nmi-protected part of the vmexit path.
> 
> As far as instrumentation goes, virtualization is simply another 'PID 
> dimension' of measurement.
> 
> Today we can isolate system performance measurements/events to the following 
> domains:
> 
>  - per system
>  - per cpu
>  - per task
> 
> ( Note that PowerPC already supports certain sorts of 
> 'hypervisor/kernel/user' 
>   domain separation, and we have some ABI details for all that but it's by no 
>   means complete. Anton is using the PowerPC bits AFAIK, so it already works 
>   to a certain degree. )
> 
> When extending measurements to KVM, we want two things:
> 
>  - user friendliness: instead of having to check 'ps' and figure out which 
>    Qemu thread is the KVM thread we want to profile, just give a convenience
>    namespace to access guest profiling info. -G ought to map to the first
>    currently running KVM guest it can find. (which would match like 90% of the
>    cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
>    useful by default the whole effort is for naught.
> 
>  - Extend core facilities and enable the following measurement dimensions:
> 
>      host-kernel-space
>      host-user-space
>      guest-kernel-space
>      guest-user-space
> 
>    on a per guest basis. We want to be able to measure just what the guest 
>    does, and we want to be able to measure just what the host does.
> 
>    Some of this the hardware helps us with (say only measuring host kernel 
>    events is possible), some has to be done by fiddling with event 
>    enable/disable at vm-exit / vm-entry time.
> 
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the 
> symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
It took me for a couple of hours to read the emails on the topic.
Based on above idea, I worked out a prototype which is ugly, but does work
with top/record when both guest side and host side use the same kernel image,
while compiling most needed modules into kernel directly..

The commands are:
perf kvm top
perf kvm record
perf kvm report

They just collect guest kernel hot functions.

> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea 
> ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.
With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu).
1) Vanilla host kernel (6G memory):
------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ 
________________________________________

            99376.00 40.5% ext3_test_allocatable           
/lib/modules/2.6.33-kvmymz/build/vmlinux
            41239.00 16.8% bitmap_search_next_usable_block 
/lib/modules/2.6.33-kvmymz/build/vmlinux
             7019.00  2.9% __ticket_spin_lock              
/lib/modules/2.6.33-kvmymz/build/vmlinux
             5350.00  2.2% copy_user_generic_string        
/lib/modules/2.6.33-kvmymz/build/vmlinux
             5208.00  2.1% do_get_write_access             
/lib/modules/2.6.33-kvmymz/build/vmlinux
             4484.00  1.8% journal_dirty_metadata          
/lib/modules/2.6.33-kvmymz/build/vmlinux
             4078.00  1.7% ext3_free_blocks_sb             
/lib/modules/2.6.33-kvmymz/build/vmlinux
             3856.00  1.6% ext3_new_blocks                 
/lib/modules/2.6.33-kvmymz/build/vmlinux
             3485.00  1.4% journal_get_undo_access         
/lib/modules/2.6.33-kvmymz/build/vmlinux
             2803.00  1.1% ext3_try_to_allocate            
/lib/modules/2.6.33-kvmymz/build/vmlinux
             2241.00  0.9% __find_get_block                
/lib/modules/2.6.33-kvmymz/build/vmlinux
             1957.00  0.8% find_revoke_record              
/lib/modules/2.6.33-kvmymz/build/vmlinux

2) guest os: start one guest os with 4GB memory.
------------------------------------------------------------------------------------------------------------------------
   PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ 
________________________________________

            41701.00 28.1% __ticket_spin_lock              
/lib/modules/2.6.33-kvmymz/build/vmlinux
            33843.00 22.8% ext3_test_allocatable           
/lib/modules/2.6.33-kvmymz/build/vmlinux
            16862.00 11.4% bitmap_search_next_usable_block 
/lib/modules/2.6.33-kvmymz/build/vmlinux
             3278.00  2.2% native_flush_tlb_others         
/lib/modules/2.6.33-kvmymz/build/vmlinux
             3200.00  2.2% copy_user_generic_string        
/lib/modules/2.6.33-kvmymz/build/vmlinux
             3009.00  2.0% do_get_write_access             
/lib/modules/2.6.33-kvmymz/build/vmlinux
             2834.00  1.9% journal_dirty_metadata          
/lib/modules/2.6.33-kvmymz/build/vmlinux
             1965.00  1.3% journal_get_undo_access         
/lib/modules/2.6.33-kvmymz/build/vmlinux
             1907.00  1.3% ext3_new_blocks                 
/lib/modules/2.6.33-kvmymz/build/vmlinux
             1790.00  1.2% ext3_free_blocks_sb             
/lib/modules/2.6.33-kvmymz/build/vmlinux
             1741.00  1.2% find_revoke_record              
/lib/modules/2.6.33-kvmymz/build/vmlinux


With vanilla host kernel, perf top data is stable and spinlock doesn't take too 
much cpu time.
With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it 
fluctuates between 9%~28%.

Another interesting finding is aim7. If I start aim7 on tmpfs testing in guest 
os with 1GB memory,
the login hangs and cpu is busy. With the new patch, I could check what happens 
in guest os, where
spinlock is busy and kernel is shrinking memory mostly from slab.



--- linux-2.6.33/arch/x86/kernel/cpu/perf_event.c       2010-02-25 
02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kernel/cpu/perf_event.c       2010-03-01 
15:57:51.672990615 +0800
@@ -1621,6 +1621,7 @@ static void intel_pmu_drain_bts_buffer(s
        struct perf_event_header header;
        struct perf_sample_data data;
        struct pt_regs regs;
+       int ret;
 
        if (!event)
                return;
@@ -1647,7 +1648,9 @@ static void intel_pmu_drain_bts_buffer(s
         * We will overwrite the from and to address before we output
         * the sample.
         */
-       perf_prepare_sample(&header, &data, event, &regs);
+       ret = perf_prepare_sample(&header, &data, event, &regs);
+       if (ret)
+               return;
 
        if (perf_output_begin(&handle, event,
                              header.size * (top - at), 1, 1))
--- linux-2.6.33/arch/x86/kvm/vmx.c     2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c     2010-03-02 10:21:57.588586179 
+0800
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/moduleparam.h>
 #include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
 
        /* We need to handle NMIs before interrupts are enabled */
        if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-           (exit_intr_info & INTR_INFO_VALID_MASK))
+           (exit_intr_info & INTR_INFO_VALID_MASK)) {
+               u64 rip = vmcs_readl(GUEST_RIP);
+               int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
+
+#ifdef CONFIG_X86_32
+               user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+               user_mode = !!(user_mode & 3);
+#endif
+               perf_save_virt_ip(user_mode, rip);
                asm("int $2");
+               perf_reset_virt_ip();
+       }
 
        idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
--- linux-2.6.33/include/linux/perf_event.h     2010-02-25 02:52:17.000000000 
+0800
+++ linux-2.6.33_perfkvm/include/linux/perf_event.h     2010-03-02 
12:26:15.050947780 +0800
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
        PERF_SAMPLE_PERIOD                      = 1U << 8,
        PERF_SAMPLE_STREAM_ID                   = 1U << 9,
        PERF_SAMPLE_RAW                         = 1U << 10,
+        PERF_SAMPLE_KVM                         = 1U << 11,
 
-       PERF_SAMPLE_MAX = 1U << 11,             /* non-ABI */
+        PERF_SAMPLE_MAX = 1U << 12,             /* non-ABI */
 };
 
 /*
@@ -798,7 +799,7 @@ extern void perf_output_sample(struct pe
                               struct perf_event_header *header,
                               struct perf_sample_data *data,
                               struct perf_event *event);
-extern void perf_prepare_sample(struct perf_event_header *header,
+extern int perf_prepare_sample(struct perf_event_header *header,
                                struct perf_sample_data *data,
                                struct perf_event *event,
                                struct pt_regs *regs);
@@ -858,7 +859,6 @@ extern void perf_bp_event(struct perf_ev
 #ifndef perf_misc_flags
 #define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
                                 PERF_RECORD_MISC_KERNEL)
-#define perf_instruction_pointer(regs) instruction_pointer(regs)
 #endif
 
 extern int perf_output_begin(struct perf_output_handle *handle,
@@ -905,6 +905,34 @@ static inline void perf_event_enable(str
 static inline void perf_event_disable(struct perf_event *event)                
{ }
 #endif
 
+//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
+#if defined(CONFIG_PERF_EVENTS)
+struct virt_ip_info {
+       int     user_mode;
+       u64     ip;
+};
+
+DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
+extern void perf_save_virt_ip(int user_mode, u64 ip);
+extern void perf_reset_virt_ip(void);
+extern int perf_get_virt_user_mode(void);
+static inline u64 perf_instruction_pointer(struct perf_event *event, struct 
pt_regs *regs)
+{
+       u64 ip;
+       if (event->attr.sample_type & PERF_SAMPLE_KVM)
+               ip = percpu_read(perf_virt_ip.ip);
+       else
+               ip = instruction_pointer(regs);
+       return ip;
+}
+#else
+static inline void perf_save_virt_ip(int user_mode, u64 ip)    { }
+static inline void perf_reset_virt_ip(void)    { }
+static inline int perf_get_virt_user_mode(void)        { return -1; }
+#define perf_instruction_pointer(event, regs)  instruction_pointer(regs))
+#endif
+
+
 #define perf_output_put(handle, x) \
        perf_output_copy((handle), &(x), sizeof(x))
 
--- linux-2.6.33/kernel/perf_event.c    2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/kernel/perf_event.c    2010-03-02 12:30:41.236003180 
+0800
@@ -3077,7 +3077,38 @@ void perf_output_sample(struct perf_outp
        }
 }
 
-void perf_prepare_sample(struct perf_event_header *header,
+//#ifdef CONFIG_PERF_VIRT_IP
+DEFINE_PER_CPU(struct virt_ip_info, perf_virt_ip) = {0,0};
+EXPORT_PER_CPU_SYMBOL(perf_virt_ip);
+
+void perf_save_virt_ip(int user_mode, u64 ip)
+{
+       if (!atomic_read(&nr_events))
+               return;
+       percpu_write(perf_virt_ip.user_mode, ip);
+       percpu_write(perf_virt_ip.ip, ip);
+}
+EXPORT_SYMBOL_GPL(perf_save_virt_ip);
+
+void perf_reset_virt_ip(void)
+{
+       if (!percpu_read(perf_virt_ip.ip))
+               return;
+       percpu_write(perf_virt_ip.user_mode, 0);
+       percpu_write(perf_virt_ip.ip, 0);
+}
+EXPORT_SYMBOL_GPL(perf_reset_virt_ip);
+
+int perf_get_virt_user_mode(void)
+{
+       if (!percpu_read(perf_virt_ip.ip))
+               return -1;
+       return percpu_read(perf_virt_ip.user_mode);
+}
+
+//#endif
+
+int perf_prepare_sample(struct perf_event_header *header,
                         struct perf_sample_data *data,
                         struct perf_event *event,
                         struct pt_regs *regs)
@@ -3090,10 +3121,15 @@ void perf_prepare_sample(struct perf_eve
        header->size = sizeof(*header);
 
        header->misc = 0;
-       header->misc |= perf_misc_flags(regs);
+       if (event->attr.sample_type & PERF_SAMPLE_KVM)
+               header->misc |= 
percpu_read(perf_virt_ip.user_mode)?PERF_RECORD_MISC_USER:PERF_RECORD_MISC_KERNEL;
+       else
+               header->misc |= perf_misc_flags(regs);
 
        if (sample_type & PERF_SAMPLE_IP) {
-               data->ip = perf_instruction_pointer(regs);
+               data->ip = perf_instruction_pointer(event, regs);
+               if (!data->ip)
+                       return -1;
 
                header->size += sizeof(data->ip);
        }
@@ -3162,6 +3198,8 @@ void perf_prepare_sample(struct perf_eve
                WARN_ON_ONCE(size & (sizeof(u64)-1));
                header->size += size;
        }
+
+       return 0;
 }
 
 static void perf_event_output(struct perf_event *event, int nmi,
@@ -3170,8 +3208,11 @@ static void perf_event_output(struct per
 {
        struct perf_output_handle handle;
        struct perf_event_header header;
+       int ret;
 
-       perf_prepare_sample(&header, data, event, regs);
+       ret = perf_prepare_sample(&header, data, event, regs);
+       if (ret)
+               return;
 
        if (perf_output_begin(&handle, event, header.size, nmi, 1))
                return;
--- linux-2.6.33/tools/perf/builtin-record.c    2010-02-25 02:52:17.000000000 
+0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-record.c    2010-03-02 
13:19:53.564376291 +0800
@@ -251,6 +251,8 @@ static void create_counter(int counter, 
                                  PERF_FORMAT_ID;
 
        attr->sample_type       |= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+       if (sample_kvm)
+               attr->sample_type       |= PERF_SAMPLE_KVM;
 
        if (freq) {
                attr->sample_type       |= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/builtin-top.c       2010-02-25 02:52:17.000000000 
+0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-top.c       2010-03-01 
16:35:41.972067501 +0800
@@ -1091,6 +1091,8 @@ static void start_counter(int i, int cou
        attr = attrs + counter;
 
        attr->sample_type       = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+       if (sample_kvm)
+               attr->sample_type       |= PERF_SAMPLE_KVM;
 
        if (freq) {
                attr->sample_type       |= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/perf.c      2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.c      2010-03-02 09:57:03.164001069 
+0800
@@ -28,6 +28,8 @@ struct pager_config {
        int val;
 };
 
+int sample_kvm = 0;
+
 static char debugfs_mntpt[MAXPATHLEN];
 
 static int pager_command_config(const char *var, const char *value, void *data)
@@ -320,6 +322,13 @@ static void handle_internal_command(int 
                argv[0] = cmd = "help";
        }
 
+       if (argc > 1 && !strcmp(argv[0], "kvm")) {
+               sample_kvm = 1;
+               argv++;
+               argc--;
+               cmd = argv[0];
+       }
+
        for (i = 0; i < ARRAY_SIZE(commands); i++) {
                struct cmd_struct *p = commands+i;
                if (strcmp(p->cmd, cmd))
--- linux-2.6.33/tools/perf/perf.h      2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.h      2010-03-01 16:12:42.470082418 
+0800
@@ -131,4 +131,6 @@ struct ip_callchain {
        u64 ips[0];
 };
 
+extern int sample_kvm;
+
 #endif


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to