Re: [akaros] Code Review - Akaros perf tool

Barret Rhoden Fri, 11 Dec 2015 09:06:13 -0800

Hi -

Good stuff overall; comments below.


Thanks,
Barret

On 2015-12-02 at 15:57 "'Davide Libenzi' via Akaros"
<[email protected]> wrote:
> As part of this CL the /prof/kpctl interface has been changed back to
> split timer+profile.
> Documentation has been updated.
> 
> 
> https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr_perf
> 
> 
> The following changes since commit
> 2fa42319139e4cc5ca853546363f84443d0ead00:
> 
>   Rename 'reallocarray' to 'kreallocarray'. (2015-11-25 18:02:04
> -0500)
> 
> are available in the git repository at:
> 
>   [email protected]:dlibenzi/akaros devarch_msr_perf

> From 69be5b1c189c29e35317dfe37d11b5ae3bc605ae Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Wed, 2 Dec 2015 09:20:41 -0800
> Subject: Fixed the sys_proc_create API to use const pointers
> 
> Fixed the sys_proc_create API to use const pointers.

Need (XCC) in the subject, since this requires a toolchain/glibc
rebuild.

> --- a/tools/compilers/gcc-glibc/glibc-2.19-akaros/sysdeps/akaros/serialize.c
> +++ b/tools/compilers/gcc-glibc/glibc-2.19-akaros/sysdeps/akaros/serialize.c
> @@ -8,8 +8,8 @@
>  #include <stddef.h>
>  #include <stdint.h>
>  
> -struct serialized_data* serialize_argv_envp(char* const* argv,
> -                                            char* const* envp)
> +struct serialized_data* serialize_argv_envp(const char * const *argv,
> +                                            const char * const *envp)
                         ->
The pointer * should be moved to be adjacent to the function name.

There are a few other warnings like this in this patch and in some of
the others.  Please run checkpatch to catch them.



> From 3e76d624cf8c4240c39c8db5020401ac14774f4b Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Mon, 16 Nov 2015 07:03:55 -0800
> Subject: Added libpfm4 library support
> 
> Added libpfm4 library support. The libpfm4 library provides a database
> of counters available on different Intel platforms.

Where did this library come from?  (URL, package name, etc).

> diff --git a/Makefile b/Makefile
> index 8defe95de83e..7832ecbef7f3 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -573,7 +573,7 @@ endif #ifeq ($(mixed-targets),1)
>  # List all userspace directories here, and state any dependencies between 
> them,
>  # such as how pthread depends on parlib.
>  
> -user-dirs = parlib pthread benchutil iplib ndblib vmm
> +user-dirs = parlib pthread benchutil iplib ndblib vmm perfmon

Does perfmon have any dependencies, such as on parlib or pthead?  If so,
we'll need entries like this:

   benchutil: parlib
   pthread: parlib benchutil
   iplib: parlib



> From 031a13e37081495c987b414b4b07902058fc5810 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Mon, 16 Nov 2015 14:53:23 -0800
> Subject: Added ability to declare local per CPU variables

> +#define DEFINE_PERCPU(type, var)                                             
> \
> +     type PERCPU_VARNAME(var) __attribute__ ((section (PERCPU_SECTION_STR)))
> +#define DECLARE_PERCPU(type, var)                                            
>                 \
> +     extern type PERCPU_VARNAME(var)                                         
>                 \
> +             __attribute__ ((section (PERCPU_SECTION_STR)))

Mostly curious, does the declaration also need the attribute?  I tried
with and without on a PERCPU that I dropped in init.c and it didn't seem
to care one way or the other.

> From 115665ce501cc4805cee552858c4cb35986bcbc5 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Tue, 24 Nov 2015 15:56:48 -0800
> Subject: Moved MSR read/write APIs out of devarch and into a dedicated file

> diff --git a/kern/arch/x86/msr.h b/kern/arch/x86/msr.h
> new file mode 100644
> index 000000000000..f7c5ccd094f4
> --- /dev/null
> +++ b/kern/arch/x86/msr.h
> @@ -0,0 +1,116 @@
> +/* Copyright (c) 2015 Google Inc
> + * Davide Libenzi <[email protected]>
> + * See LICENSE for details.
> + */
> +
> +#pragma once
> +
> +#include <sys/types.h>
> +#include <ros/errno.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <core_set.h>
> +
> +struct msr_address {
> +     uint32_t addr;
> +     const uint32_t *addresses;
> +     size_t num_addresses;
> +};

So a struct msr_address has both an overall address, as well as an address per
core (based on its usage later)?  When does each core need a different address
for a particular MSR (e.g. MSR_IA32_PERFCTR0)?

Or is this for something else, like "I want to read different MSRs on different
cores, but read them all at the same time."  I'm trying to picture how this is
used and what it's used for.  If it's cryptic, we can put something in this
header file's comments at the top.



> From fecb0cd575a71d8fa2fe5c19728337b3a6246801 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Fri, 27 Nov 2015 10:57:19 -0800
> Subject: Migrated devarch MSR access to new MSR API

> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long n, 
> int64_t offset)

> +                     msr_set_values(&msrv, values, num_cores);
> +
> +                     err = msr_cores_read(&cset, &msra, &msrv);
> +
> +                     n = -1;
> +                     if (likely(!err)) {
> +                             if (n >= num_cores * sizeof(uint64_t)) {
> +                                     if (!memcpy_to_user_errno(current, a, 
> values,
> +                                                                             
>           num_cores * sizeof(uint64_t)))
> +                                             n = num_cores * 
> sizeof(uint64_t);
>                               } else {

What is going on here?  n == -1, then we see if it is great enough to cover the
whole range of our output (which is always positive)?


> @@ -458,13 +467,19 @@ static long archwrite(struct chan *c, void *a, long n, 
> int64_t offset)
>                       if (!address_range_find(msr_wr_wlist, 
> ARRAY_SIZE(msr_wr_wlist),
>                                                                       
> (uintptr_t) offset))
>                               error(EPERM, NULL);
> -                     core_set_init(&cset);
> -                     core_set_fill_available(&cset);
>                       if (n != sizeof(uint64_t))
>                               error(EINVAL, NULL);
>                       if (memcpy_from_user_errno(current, &value, a, 
> sizeof(value)))
>                               return -1;
> -                     err = msr_cores_write(&cset, (uint32_t) offset, value);
> +
> +                     core_set_init(&cset);
> +                     core_set_fill_available(&cset);
> +                     msr_init_address(&msra);
> +                     msr_set_address(&msra, (uint32_t) offset);
> +                     msr_init_value(&msrv);
> +                     msr_set_value(&msrv, value);

It's a little odd to have to init_value() and then set_value().  I looked at the
functions, so I know that msr_init_value() zero's the msr_value struct, while
msr_set_value() sets the value field of the msr_value struct.



> From 820e28e014e6e413e6d43c1941d5d45bd91b096f Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Mon, 16 Nov 2015 07:13:13 -0800
> Subject: Added perfmon interrupt handling to allow overflow based profiling

> --- a/kern/arch/x86/perfmon.c
> +++ b/kern/arch/x86/perfmon.c

> +static void perfmon_do_cores_alloc(void *opaque)
> +{

> +                     perfmon_enable_fix_event(i, TRUE);
> +
> +                     write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> +                                       -(int64_t) pa->ev.trigger_count);
> +                     write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL, tmp);
> +             }

Should the enabling of the event happen before or after setting the event?  If
it happens before (as is here) would we collect perf events from the old setting
of the fixed counter?  Same goes for down below in the non-fixed counter.
perfmon_do_cores_free() looks okay (disables the event before changing the
fields).

> +     } else {
> +             for (i = 0; (i < (int) cpu_caps.counters_x_proc) &&
> +                              ((cctx->counters[i].u.v != 0) ||
> +                               !perfmon_event_available(i)); i++)
> +                     ;

That's pretty nasty.  I get the impression we want to find the first free
counter (i = idx).  It'd be much clearer with the loop just going over i, and
adding breaks or continues or something.

Is there ever a time that the counters[i] value is 0, but the perfmon_event is
not available?  I.e., is that a kernel bug, and thus should be a panic?


> +static void perfmon_do_cores_free(void *opaque)
> +{
> +     struct perfmon_alloc *pa = (struct perfmon_alloc *) opaque;
> +     struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> +     int err = 0, coreno = core_id();
> +     counter_t ccno = pa->cores_counters[coreno];
> +
> +     spin_lock_irqsave(&counters_lock);

What is the lock protecting?  (Both here and in the other perfmon_do_
functions).  I imagine we need to disable interrupts, in case a PMI IRQ fires,
but do we need to serialize these accesses across the entire machine?

> +     if (perfmon_is_fixed_event(&pa->ev)) {
> +             uint64_t fxctrl_value = read_msr(MSR_CORE_PERF_FIXED_CTR_CTRL);
> +
> +             if (!(fxctrl_value & (FIXCNTR_MASK << ccno)) ||
> +                     (ccno >= cpu_caps.fix_counters_x_proc)) {
> +                     err = -ENOENT;
> +             } else {
> +                     perfmon_init_event(&cctx->fixed_counters[ccno]);
> +
> +                     perfmon_enable_fix_event((int) ccno, FALSE);
> +
> +                     write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL,
> +                                       fxctrl_value & ~(FIXCNTR_MASK << 
> ccno));
> +                     write_msr(MSR_CORE_PERF_FIXED_CTR0 + ccno, 0);

For the (fxctrl_value & (FIXCNTR_MASK << ccno)) checks, should "ccno" be
multipled by FIXCNTR_NBITS?  You're trying to select a 4-bit chunk of the
register, right?  (btw, I'm looking at the SDM V3 18.2, Fig 18-2 for the layout
of the register).


> +void perfmon_init(void)
> +{
> +     int i;
>  
>       /* Enable user level access to the performance counters */
>       lcr4(rcr4() | CR4_PCE);
> +
> +     /* This will be called from every core, no need to execute more than 
> once.
> +      */
> +     if (cpu_caps.perfmon_version == 0)
> +             perfmon_read_cpu_caps(&cpu_caps);

Technically this is racy.  We might be okay, since they are all writing the same
values.

> +     write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
> +}

This will break when we move to the x2APIC.  We can deal with it later.  =)

> +void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data)
> +{
> +     int i;
> +     struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> +     uint64_t gctrl, status;
> +
> +     profiler_add_hw_sample(hw_tf);

So it looks like we generate a trace whenever any of the counters overflow, but
we lose track of which one overflowed.  I guess that's what userland perf
expects?

> +     spin_lock_irqsave(&counters_lock);
> +     /* We need to save the global control status, because we need to disable
> +      * counters in order to be able to reset their values.
> +      * We will restore the global control status on exit.
> +      */
> +     gctrl = read_msr(MSR_CORE_PERF_GLOBAL_CTRL);
> +     write_msr(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> +     status = read_msr(MSR_CORE_PERF_GLOBAL_STATUS);
> +     for (i = 0; i < (int) cpu_caps.counters_x_proc; i++) {
> +             if (status & ((uint64_t) 1 << i)) {
> +                     if (cctx->counters[i].u.v)
> +                             write_msr(MSR_IA32_PERFCTR0 + i,
> +                                               -(int64_t) 
> cctx->counters[i].trigger_count);
> +             }
> +     }
> +     for (i = 0; i < (int) cpu_caps.fix_counters_x_proc; i++) {
> +             if (status & ((uint64_t) 1 << (32 + i))) {
> +                     if (cctx->fixed_counters[i].u.v)
> +                             write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> +                                               -(int64_t) 
> cctx->fixed_counters[i].trigger_count);
> +             }
> +     }
> +     write_msr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status);
> +     write_msr(MSR_CORE_PERF_GLOBAL_CTRL, gctrl);
> +     spin_unlock_irqsave(&counters_lock);
> +
> +     write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);

^This probably deserves a comment.  Unlike the other LAPIC IRQs, the perf IRQ
gets masked in the LAPIC on interrupt.


> +static int perfmon_alloc_get(struct perfmon_session *ps, int ped, bool reset,
> +                                                      struct perfmon_alloc 
> **ppa)
> +{
> +     struct perfmon_alloc *pa;
> +
> +     if (unlikely((ped < 0) || (ped >= ARRAY_SIZE(ps->allocs))))
> +             return -EBADFD;

What is ped, an FD?  If it's just an index that can't be < 0, you could make it
an unsigned type.  I see the user of this (devarch) was casting it to an int
anyways.

> +     spin_lock(&ps->lock);
> +     pa = ps->allocs[ped];
> +     if (likely(pa)) {
> +             if (reset)
> +                     ps->allocs[ped] = NULL;
> +             else
> +                     kref_get(&pa->ref, 1);
> +     }
> +     spin_unlock(&ps->lock);
> +     if (unlikely(!pa))
> +             return -ENOENT;
> +     *ppa = pa;
> +
> +     return 0;
> +}
> +
> +int perfmon_close_event(struct perfmon_session *ps, int ped)
> +{
> +     int err;
> +     struct perfmon_alloc *pa;
> +
> +     err = perfmon_alloc_get(ps, ped, TRUE, &pa);
> +     if (likely(!err))
> +             kref_put(&pa->ref);
> +
> +     return err;
> +}
> +
> +struct perfmon_status *perfmon_get_event_status(struct perfmon_session *ps,
> +                                                                             
>                 int ped)
> +{
> +     int err;
> +     struct core_set cset;
> +     struct perfmon_status_env env;
> +
> +     err = perfmon_alloc_get(ps, ped, FALSE, &env.pa);
> +     if (unlikely(err))
> +             return ERR_PTR(err);

If you can avoid it, please don't use ERR_PTR.  It makes it a lot harder to tell
how to use a function by looking at its signature.  For instance, if I didn't
look into the C code, I'd think the following is fine:

        foo = perfmon_get_event_status(ps, ped);
        if (!foo)
                handle_the_error.

In general, we opt to set errno if possible (which i think is the case here)
instead of using ERR_PTR.  ERR_PTR and IS_ERR is pretty much only for linux
compatibility (and ancient memcpy code from 7 years ago).  Throwing with error()
is another option.

If you do have to use it here, which could be the case if we're called from IRQ
context, then please comment the function accordingly.

> +     env.pef = perfmon_alloc_status();
> +     perfmon_setup_alloc_core_set(env.pa, &cset);
> +
> +     smp_do_in_cores(&cset, perfmon_do_cores_status, &env);

Are there any weird cases where multiple threads are using the same pa?  (env.pa
here).  

Is 'pa' basically read-only once we install_session_alloc in perfmon_open_event?
If not, how do we protect from having multiple users?

If that's true and it's the invariant that keeps us from clobbering pa's state,
then please put that in the comments somewhere.  Otherwise someone in the future
will modify this and enter a nightmare.

> +struct perfmon_event {
> +     union {
> +             struct {
> +                     uint64_t event:8;
> +                     uint64_t mask:8;
> +                     uint64_t usr:1;
> +                     uint64_t os:1;
> +                     uint64_t edge:1;
> +                     uint64_t pc:1;
> +                     uint64_t inten:1;
> +                     uint64_t anyth:1;
> +                     uint64_t en:1;
> +                     uint64_t invcmsk:1;
> +                     uint64_t cmask:8;
> +                     uint64_t res:32;
> +             } b;
> +             uint64_t v;
> +     } u;
> +     uint64_t flags;
> +     uint64_t trigger_count;
> +};

If you can avoid the bitfields in the future, that'd be nice.  We try to avoid
using them where possible, in favor of masks and shifts.  If this is hard to
change here, then don't bother.



> From 907da2682383ca581a58352456c30c4dab7cd762 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Sat, 28 Nov 2015 18:35:57 -0800
> Subject: Created the new devarch perf file using the perfmon infrastructure
> 

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c

> +static void arch_free_perf_context(struct perf_context *pc)
> +{
> +     if (likely(pc)) {

>From the usage of this function, it looks like a bug if there is !pc.

> +             perfmon_close_session(pc->ps);
> +             kfree(pc->resp);
> +             kfree(pc);
> +     }
> +}


> +static long arch_perf_write(struct perf_context *pc, const void *udata,
> +                                                     long usize)
> +{
> +     ERRSTACK(1);
> +     void *kdata;
> +     const uint8_t *kptr, *ktop;
> +
> +     kfree(pc->resp);
> +     pc->resp = NULL;
> +     pc->resp_size = 0;
> +
> +     kdata = user_memdup_errno(current, udata, usize);
> +     if (unlikely(!kdata))
> +             return -1;
> +     if (waserror()) {
> +             kfree(kdata);
> +             nexterror();
> +     }
> +     kptr = kdata;
> +     ktop = kptr + usize;
> +     error_assert(EBADMSG, (kptr + 1) <= ktop);
> +     switch (*kptr) {

Minor thing, but if you read *kptr and incremented it, like you do later with
the get_le_ helpers, you wouldn't need to remember to +1 it the first time you
use it in every case {}.

And if you read *kptr into something like uint8_t cmd, you can use 'cmd' later
in an error message.

> +             case PERFMON_CMD_COUNTER_STATUS: {
> +                     uint32_t ped;
> +                     struct perfmon_status *pef;
> +
> +                     error_assert(EBADMSG, (kptr + 1 + sizeof(uint32_t)) <= 
> ktop);
> +                     kptr = get_le_u32(kptr + 1, &ped);
> +
> +                     pef = perfmon_get_event_status(pc->ps, (int) ped);
> +                     if (!IS_ERR(pef)) {
> +                             int i;
> +                             uint8_t *rptr;
> +                             uint64_t *mvalues = kzmalloc(num_cores * 
> sizeof(mvalues),
> +                                                                             
>          KMALLOC_WAIT);
> +
> +                             for (i = 0; i < num_cores; i++)
> +                                     mvalues[get_os_coreid(i)] = 
> pef->cores_values[i];

This usage of get_os_coreid is probably wrong.  The range of i is the list of
os_coreids already, not hw_coreids.  Probably just want mvalues[i].

> +                             pc->resp_size = 3 * sizeof(uint64_t) + 
> sizeof(uint16_t) +
> +                                     num_cores * sizeof(uint64_t);
> +                             pc->resp = kmalloc(pc->resp_size, KMALLOC_WAIT);
> +
> +                             rptr = put_le_u64(pc->resp, pef->ev.u.v);
> +                             rptr = put_le_u64(rptr, pef->ev.flags);
> +                             rptr = put_le_u64(rptr, pef->ev.trigger_count);
> +                             rptr = put_le_u16(rptr, num_cores);

Not that we actually support more than 2^16 cores yet, but num_cores is an int.

> +                             for (i = 0; i < num_cores; i++)
> +                                     rptr = put_le_u64(rptr, mvalues[i]);
> +                             kfree(mvalues);
> +                             perfmon_free_event_status(pef);
> +                     } else {
> +                             error(-PTR_ERR(pef), NULL);


Can we just have perfmon_get_event_status() throw?  Same goes for
perfmon_open_event() and all of these functions.


> +             }
> +             default:
> +                     error(EINVAL, NULL);

We should take advantage of error() and have a message, like
        error(EINVAL, "Bad perfmon command 0x%x", cmd);



> From 14650311e81d001bd6cc08d57424f4bcc24d2336 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <[email protected]>
> Date: Mon, 16 Nov 2015 07:05:47 -0800
> Subject: Added new perf utility to access CPU counters from userspace
> 
> Added new perf utility to access CPU counters from userspace.

Like the other tools such as snc and kprof2perf, this can't be built by cd-ing
into the directory and running make.  I'd like to fix that before it gets out of
hand, though maybe the cure is worse than the symptoms for now.


> diff --git a/tools/profile/perf/perf.c b/tools/profile/perf/perf.c
> new file mode 100644
> index 000000000000..0b5ef5d950b1
> --- /dev/null
> +++ b/tools/profile/perf/perf.c

> +static struct perf_context_config perf_cfg = {
> +     .perf_file = "#arch/perf",
> +     .kpctl_file = "/prof/kpctl",

It's safer to use #kprof instead of /prof.  We happen to have it bound in our
current ifconfig, but there's no guarantee that'll always be the case.  In
general, if you know the device by #name, use it directly.


> +static void run_process_and_wait(int argc, const char * const *argv,
> +                                                              const struct 
> core_set *cores)
> +{
> +     int pid, status;
> +     size_t max_cores = ros_total_cores();
> +     struct core_set pvcores;
> +
> +     pid = sys_proc_create(argv[0], strlen(argv[0]), argv, NULL, 0);
> +     if (pid < 0) {
> +             perror(argv[0]);
> +             exit(1);
> +     }
> +
> +     ros_get_low_latency_core_set(&pvcores);
> +     ros_not_core_set(&pvcores);
> +     ros_and_core_sets(&pvcores, cores);
> +     for (size_t i = 0; i < max_cores; i++) {
> +             if (ros_get_bit(&pvcores, i)) {
> +                     if (sys_provision(pid, RES_CORES, i)) {
> +                             fprintf(stderr,
> +                                             "Unable to provision CPU %lu to 
> PID %d: cmd='%s'\n",
> +                                             i, pid, argv[0]);
> +                             exit(1);
> +                     }
> +             }
> +     }
> +
> +     sys_proc_run(pid);

As an FYI, the process will only run on the provisioned cores once it becomes an
MCP.  If you are running an app that is an SCP, then it will never run on those
cores.  If it will be an MCP, you won't capture the events from when it was on
core 0 either.  


-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - Akaros perf tool

Reply via email to