Re: [akaros] Code Review - Akaros perf tool

'Davide Libenzi' via Akaros Fri, 11 Dec 2015 12:10:27 -0800

On Fri, Dec 11, 2015 at 9:05 AM, Barret Rhoden <[email protected]> wrote:


> Need (XCC) in the subject, since this requires a toolchain/glibc
> rebuild.
>

Sorry, forgot about that.


> -struct serialized_data* serialize_argv_envp(char* const* argv,
> > -                                            char* const* envp)
> > +struct serialized_data* serialize_argv_envp(const char * const *argv,
> > +                                            const char * const *envp)
>                          ->
> The pointer * should be moved to be adjacent to the function name.
>
> There are a few other warnings like this in this patch and in some of
> the others.  Please run checkpatch to catch them.
>

I left how it was. But I do have checkpatch in my pre-commit hook, and I
wonder why it did not trigger.
This is the pre-commit I have (given to me by Kevin):

https://gist.github.com/dlibenzi/6fbcd64c2e81eb430226

WARNING: Do not use whitespace before Signed-off-by:

Hmm, this is git-s doing that.
I fixed the others.



> > Added libpfm4 library support. The libpfm4 library provides a database
> > of counters available on different Intel platforms.
>
> Where did this library come from?  (URL, package name, etc).
>

Apparently from a guy sitting next to me 😀
Did not explicitly add the links, as there is little confusion:

https://www.google.com/search?q=libpfm4



> --- a/Makefile
> > +++ b/Makefile
> > @@ -573,7 +573,7 @@ endif #ifeq ($(mixed-targets),1)
> >  # List all userspace directories here, and state any dependencies
> between them,
> >  # such as how pthread depends on parlib.
> >
> > -user-dirs = parlib pthread benchutil iplib ndblib vmm
> > +user-dirs = parlib pthread benchutil iplib ndblib vmm perfmon
>
> Does perfmon have any dependencies, such as on parlib or pthead?  If so,
> we'll need entries like this:
>
>    benchutil: parlib
>    pthread: parlib benchutil
>    iplib: parlib
>

The library has no dependencies.
The code was literally taken (removed some Linux pref specific files) and
added to the perfmon folder with its own Makefile.



> From 031a13e37081495c987b414b4b07902058fc5810 Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Mon, 16 Nov 2015 14:53:23 -0800
> > Subject: Added ability to declare local per CPU variables
>
> > +#define DEFINE_PERCPU(type, var)
>      \
> > +     type PERCPU_VARNAME(var) __attribute__ ((section
> (PERCPU_SECTION_STR)))
> > +#define DECLARE_PERCPU(type, var)
>                       \
> > +     extern type PERCPU_VARNAME(var)
>                      \
> > +             __attribute__ ((section (PERCPU_SECTION_STR)))
>
> Mostly curious, does the declaration also need the attribute?  I tried
> with and without on a PERCPU that I dropped in init.c and it didn't seem
> to care one way or the other.
>

Apparently it doesn't.
I would slightly prefer to leave it there, in case GCC might decide to care
later on 😀




> > +struct msr_address {
> > +     uint32_t addr;
> > +     const uint32_t *addresses;
> > +     size_t num_addresses;
> > +};
>
> So a struct msr_address has both an overall address, as well as an address
> per
> core (based on its usage later)?  When does each core need a different
> address
> for a particular MSR (e.g. MSR_IA32_PERFCTR0)?
>
> Or is this for something else, like "I want to read different MSRs on
> different
> cores, but read them all at the same time."  I'm trying to picture how
> this is
> used and what it's used for.  If it's cryptic, we can put something in this
> header file's comments at the top.
>

Yes. You can chose to use the same addr for every CPU, or, if you need to
use different addresses for every CPU, it allows that as well.
Similarly for values. You can use one value for all, or a dedicated one for
each CPU.
This allow one-shot programming instead of multiple ones.
Yes, it's because programming a single logical (from a user POV) counter,
might involve issuing MSR calls at different addresses.
Added a comment.



> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long
> n, int64_t offset)
>
> > +                     msr_set_values(&msrv, values, num_cores);
> > +
> > +                     err = msr_cores_read(&cset, &msra, &msrv);
> > +
> > +                     n = -1;
> > +                     if (likely(!err)) {
> > +                             if (n >= num_cores * sizeof(uint64_t)) {
> > +                                     if (!memcpy_to_user_errno(current,
> a, values,
> > +
>                num_cores * sizeof(uint64_t)))
> > +                                             n = num_cores *
> sizeof(uint64_t);
> >                               } else {
>
> What is going on here?  n == -1, then we see if it is great enough to
> cover the
> whole range of our output (which is always positive)?
>

A bit cryptic. I admit 😎
Changed.




> @@ -458,13 +467,19 @@ static long archwrite(struct chan *c, void *a, long
> n, int64_t offset)
> >                       if (!address_range_find(msr_wr_wlist,
> ARRAY_SIZE(msr_wr_wlist),
> >
>  (uintptr_t) offset))
> >                               error(EPERM, NULL);
> > -                     core_set_init(&cset);
> > -                     core_set_fill_available(&cset);
> >                       if (n != sizeof(uint64_t))
> >                               error(EINVAL, NULL);
> >                       if (memcpy_from_user_errno(current, &value, a,
> sizeof(value)))
> >                               return -1;
> > -                     err = msr_cores_write(&cset, (uint32_t) offset,
> value);
> > +
> > +                     core_set_init(&cset);
> > +                     core_set_fill_available(&cset);
> > +                     msr_init_address(&msra);
> > +                     msr_set_address(&msra, (uint32_t) offset);
> > +                     msr_init_value(&msrv);
> > +                     msr_set_value(&msrv, value);
>
> It's a little odd to have to init_value() and then set_value().  I looked
> at the
> functions, so I know that msr_init_value() zero's the msr_value struct,
> while
> msr_set_value() sets the value field of the msr_value struct.
>

OK, dropping the init, and leaving the set, which will do init+set.



> --- a/kern/arch/x86/perfmon.c
> > +++ b/kern/arch/x86/perfmon.c
>
> > +static void perfmon_do_cores_alloc(void *opaque)
> > +{
>
> > +                     perfmon_enable_fix_event(i, TRUE);
> > +
> > +                     write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> > +                                       -(int64_t) pa->ev.trigger_count);
> > +                     write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL, tmp);
> > +             }
>
> Should the enabling of the event happen before or after setting the
> event?  If
> it happens before (as is here) would we collect perf events from the old
> setting
> of the fixed counter?  Same goes for down below in the non-fixed counter.
> perfmon_do_cores_free() looks okay (disables the event before changing the
> fields).
>

It does not change much either ways, as the real programming comes when we
write the CTRL register.




>
> > +     } else {
> > +             for (i = 0; (i < (int) cpu_caps.counters_x_proc) &&
> > +                              ((cctx->counters[i].u.v != 0) ||
> > +                               !perfmon_event_available(i)); i++)
> > +                     ;
>
> That's pretty nasty.  I get the impression we want to find the first free
> counter (i = idx).  It'd be much clearer with the loop just going over i,
> and
> adding breaks or continues or something.
>

OK, moving some break conditions within the loop.




>
> Is there ever a time that the counters[i] value is 0, but the
> perfmon_event is
> not available?  I.e., is that a kernel bug, and thus should be a panic?
>

Not sure we should panic. Added a warn_once().
Which I noticed is broken, if put in an if/else (fixed in a new commit in
the stream).




> +static void perfmon_do_cores_free(void *opaque)
> > +{
> > +     struct perfmon_alloc *pa = (struct perfmon_alloc *) opaque;
> > +     struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> > +     int err = 0, coreno = core_id();
> > +     counter_t ccno = pa->cores_counters[coreno];
> > +
> > +     spin_lock_irqsave(&counters_lock);
>
> What is the lock protecting?  (Both here and in the other perfmon_do_
> functions).  I imagine we need to disable interrupts, in case a PMI IRQ
> fires,
> but do we need to serialize these accesses across the entire machine?
>

The could be per-CPU locks, but given that this is not an extremely hot
path, I chose to not do that.
Turned into per-CPU locks.




>
> > +     if (perfmon_is_fixed_event(&pa->ev)) {
> > +             uint64_t fxctrl_value =
> read_msr(MSR_CORE_PERF_FIXED_CTR_CTRL);
> > +
> > +             if (!(fxctrl_value & (FIXCNTR_MASK << ccno)) ||
> > +                     (ccno >= cpu_caps.fix_counters_x_proc)) {
> > +                     err = -ENOENT;
> > +             } else {
> > +                     perfmon_init_event(&cctx->fixed_counters[ccno]);
> > +
> > +                     perfmon_enable_fix_event((int) ccno, FALSE);
> > +
> > +                     write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL,
> > +                                       fxctrl_value & ~(FIXCNTR_MASK <<
> ccno));
> > +                     write_msr(MSR_CORE_PERF_FIXED_CTR0 + ccno, 0);
>
> For the (fxctrl_value & (FIXCNTR_MASK << ccno)) checks, should "ccno" be
> multipled by FIXCNTR_NBITS?  You're trying to select a 4-bit chunk of the
> register, right?  (btw, I'm looking at the SDM V3 18.2, Fig 18-2 for the
> layout
> of the register).
>

Duh! Yes.
I was sure I had fixed that, but I think I did the fix only in
perfmon_get_fixevent_mask().



> +void perfmon_init(void)
> > +{
> > +     int i;
> >
> >       /* Enable user level access to the performance counters */
> >       lcr4(rcr4() | CR4_PCE);
> > +
> > +     /* This will be called from every core, no need to execute more
> than once.
> > +      */
> > +     if (cpu_caps.perfmon_version == 0)
> > +             perfmon_read_cpu_caps(&cpu_caps);
>
> Technically this is racy.  We might be okay, since they are all writing
> the same
> values.
>

Yes, in theory, but as you notice, they end up writing the same value.




>
> > +     write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
> > +}
>
> This will break when we move to the x2APIC.  We can deal with it later.  =)
>

Yeah, I can't program for things which do not exist yet 😎



> +void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data)
> > +{
> > +     int i;
> > +     struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> > +     uint64_t gctrl, status;
> > +
> > +     profiler_add_hw_sample(hw_tf);
>
> So it looks like we generate a trace whenever any of the counters
> overflow, but
> we lose track of which one overflowed.  I guess that's what userland perf
> expects?
>

I need to add an "ID", and figure out how to pass that to userland perf.
Will do in follow up CL.




>
> > +     spin_lock_irqsave(&counters_lock);
> > +     /* We need to save the global control status, because we need to
> disable
> > +      * counters in order to be able to reset their values.
> > +      * We will restore the global control status on exit.
> > +      */
> > +     gctrl = read_msr(MSR_CORE_PERF_GLOBAL_CTRL);
> > +     write_msr(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> > +     status = read_msr(MSR_CORE_PERF_GLOBAL_STATUS);
> > +     for (i = 0; i < (int) cpu_caps.counters_x_proc; i++) {
> > +             if (status & ((uint64_t) 1 << i)) {
> > +                     if (cctx->counters[i].u.v)
> > +                             write_msr(MSR_IA32_PERFCTR0 + i,
> > +                                               -(int64_t)
> cctx->counters[i].trigger_count);
> > +             }
> > +     }
> > +     for (i = 0; i < (int) cpu_caps.fix_counters_x_proc; i++) {
> > +             if (status & ((uint64_t) 1 << (32 + i))) {
> > +                     if (cctx->fixed_counters[i].u.v)
> > +                             write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> > +                                               -(int64_t)
> cctx->fixed_counters[i].trigger_count);
> > +             }
> > +     }
> > +     write_msr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status);
> > +     write_msr(MSR_CORE_PERF_GLOBAL_CTRL, gctrl);
> > +     spin_unlock_irqsave(&counters_lock);
> > +
> > +     write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
>
> ^This probably deserves a comment.  Unlike the other LAPIC IRQs, the perf
> IRQ
> gets masked in the LAPIC on interrupt.
>

Added.






>
>
> > +static int perfmon_alloc_get(struct perfmon_session *ps, int ped, bool
> reset,
> > +                                                      struct
> perfmon_alloc **ppa)
> > +{
> > +     struct perfmon_alloc *pa;
> > +
> > +     if (unlikely((ped < 0) || (ped >= ARRAY_SIZE(ps->allocs))))
> > +             return -EBADFD;
>
> What is ped, an FD?  If it's just an index that can't be < 0, you could
> make it
> an unsigned type.  I see the user of this (devarch) was casting it to an
> int
> anyways.
>

Like an FD, a signed value is needed to signal errors in API which return
it as value.


> +struct perfmon_status *perfmon_get_event_status(struct perfmon_session
> *ps,
> > +
>                      int ped)
> > +{
> > +     int err;
> > +     struct core_set cset;
> > +     struct perfmon_status_env env;
> > +
> > +     err = perfmon_alloc_get(ps, ped, FALSE, &env.pa);
> > +     if (unlikely(err))
> > +             return ERR_PTR(err);
>
> If you can avoid it, please don't use ERR_PTR.  It makes it a lot harder
> to tell
> how to use a function by looking at its signature.  For instance, if I
> didn't
> look into the C code, I'd think the following is fine:
>
>         foo = perfmon_get_event_status(ps, ped);
>         if (!foo)
>                 handle_the_error.
>
> In general, we opt to set errno if possible (which i think is the case
> here)
> instead of using ERR_PTR.  ERR_PTR and IS_ERR is pretty much only for linux
> compatibility (and ancient memcpy code from 7 years ago).  Throwing with
> error()
> is another option.
>
> If you do have to use it here, which could be the case if we're called
> from IRQ
> context, then please comment the function accordingly.
>

Yes, setting errno or throwing, make the APIs user-context only.
Even the "does it throw? does not?" thing is not exactly clear by simply
looking at API signatures.
Should we have a PTR tag to signal those?

#define __errptr

struct foo __errptr *my_foo();





>
> > +     env.pef = perfmon_alloc_status();
> > +     perfmon_setup_alloc_core_set(env.pa, &cset);
> > +
> > +     smp_do_in_cores(&cset, perfmon_do_cores_status, &env);
>
> Are there any weird cases where multiple threads are using the same pa?  (
> env.pa
> here).
>
> Is 'pa' basically read-only once we install_session_alloc in
> perfmon_open_event?
> If not, how do we protect from having multiple users?
>
> If that's true and it's the invariant that keeps us from clobbering pa's
> state,
> then please put that in the comments somewhere.  Otherwise someone in the
> future
> will modify this and enter a nightmare.
>

Does not matter. perf alloc (pa) are immutable and perfmon_alloc_get() gets
a reference on them.





>
> > +struct perfmon_event {
> > +     union {
> > +             struct {
> > +                     uint64_t event:8;
> > +                     uint64_t mask:8;
> > +                     uint64_t usr:1;
> > +                     uint64_t os:1;
> > +                     uint64_t edge:1;
> > +                     uint64_t pc:1;
> > +                     uint64_t inten:1;
> > +                     uint64_t anyth:1;
> > +                     uint64_t en:1;
> > +                     uint64_t invcmsk:1;
> > +                     uint64_t cmask:8;
> > +                     uint64_t res:32;
> > +             } b;
> > +             uint64_t v;
> > +     } u;
> > +     uint64_t flags;
> > +     uint64_t trigger_count;
> > +};
>
> If you can avoid the bitfields in the future, that'd be nice.  We try to
> avoid
> using them where possible, in favor of masks and shifts.  If this is hard
> to
> change here, then don't bother.
>

It makes code much cleaner IMO.
You ad to create macros (or open code them) for getting/setting every one
of them.
Which is pretty tedious when you have that many.




>
>
>
> > From 907da2682383ca581a58352456c30c4dab7cd762 Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sat, 28 Nov 2015 18:35:57 -0800
> > Subject: Created the new devarch perf file using the perfmon
> infrastructure
> >
>
> > --- a/kern/arch/x86/devarch.c
> > +++ b/kern/arch/x86/devarch.c
>
> > +static void arch_free_perf_context(struct perf_context *pc)
> > +{
> > +     if (likely(pc)) {
>
> From the usage of this function, it looks like a bug if there is !pc.
>

Typically all the "pointer free" APIs should be allowed for NULLs inputs.




> +static long arch_perf_write(struct perf_context *pc, const void *udata,
> > +                                                     long usize)
> > +{
> > +     ERRSTACK(1);
> > +     void *kdata;
> > +     const uint8_t *kptr, *ktop;
> > +
> > +     kfree(pc->resp);
> > +     pc->resp = NULL;
> > +     pc->resp_size = 0;
> > +
> > +     kdata = user_memdup_errno(current, udata, usize);
> > +     if (unlikely(!kdata))
> > +             return -1;
> > +     if (waserror()) {
> > +             kfree(kdata);
> > +             nexterror();
> > +     }
> > +     kptr = kdata;
> > +     ktop = kptr + usize;
> > +     error_assert(EBADMSG, (kptr + 1) <= ktop);
> > +     switch (*kptr) {
>
> Minor thing, but if you read *kptr and incremented it, like you do later
> with
> the get_le_ helpers, you wouldn't need to remember to +1 it the first time
> you
> use it in every case {}.
>

I would have loved that, if you (and Kevin) corrected me previously in
doing *ptr++ things 😀
Changed to sane-c-code now.



> +             case PERFMON_CMD_COUNTER_STATUS: {
> > +                     uint32_t ped;
> > +                     struct perfmon_status *pef;
> > +
> > +                     error_assert(EBADMSG, (kptr + 1 +
> sizeof(uint32_t)) <= ktop);
> > +                     kptr = get_le_u32(kptr + 1, &ped);
> > +
> > +                     pef = perfmon_get_event_status(pc->ps, (int) ped);
> > +                     if (!IS_ERR(pef)) {
> > +                             int i;
> > +                             uint8_t *rptr;
> > +                             uint64_t *mvalues = kzmalloc(num_cores *
> sizeof(mvalues),
> > +
>               KMALLOC_WAIT);
> > +
> > +                             for (i = 0; i < num_cores; i++)
> > +                                     mvalues[get_os_coreid(i)] =
> pef->cores_values[i];
>
> This usage of get_os_coreid is probably wrong.  The range of i is the list
> of
> os_coreids already, not hw_coreids.  Probably just want mvalues[i].
>

Yes, that is a leftover from a patch which was trying remap logical to HW
core IDs.
But then realized that send_ipi() does the conversion.
Dropped that get_os_coreid() now.





>
> > +                             pc->resp_size = 3 * sizeof(uint64_t) +
> sizeof(uint16_t) +
> > +                                     num_cores * sizeof(uint64_t);
> > +                             pc->resp = kmalloc(pc->resp_size,
> KMALLOC_WAIT);
> > +
> > +                             rptr = put_le_u64(pc->resp, pef->ev.u.v);
> > +                             rptr = put_le_u64(rptr, pef->ev.flags);
> > +                             rptr = put_le_u64(rptr,
> pef->ev.trigger_count);
> > +                             rptr = put_le_u16(rptr, num_cores);
>
> Not that we actually support more than 2^16 cores yet, but num_cores is an
> int.
>

Yeah, I figured, by the time we hit 65K cores, Intel might have changed the
perf API 10 times, so this code might need revisioning anyway 😎
In any case, turned to u32.




>
> > +                             for (i = 0; i < num_cores; i++)
> > +                                     rptr = put_le_u64(rptr,
> mvalues[i]);
> > +                             kfree(mvalues);
> > +                             perfmon_free_event_status(pef);
> > +                     } else {
> > +                             error(-PTR_ERR(pef), NULL);
>
>
> Can we just have perfmon_get_event_status() throw?  Same goes for
> perfmon_open_event() and all of these functions.
>

OK, done.




>
>
> > +             }
> > +             default:
> > +                     error(EINVAL, NULL);
>
> We should take advantage of error() and have a message, like
>         error(EINVAL, "Bad perfmon command 0x%x", cmd);
>

Done.



> Added new perf utility to access CPU counters from userspace.
>
> Like the other tools such as snc and kprof2perf, this can't be built by
> cd-ing
> into the directory and running make.  I'd like to fix that before it gets
> out of
> hand, though maybe the cure is worse than the symptoms for now.
>

I asked the question a few times, about the revised build system. Got no
answers.





>
>
> > diff --git a/tools/profile/perf/perf.c b/tools/profile/perf/perf.c
> > new file mode 100644
> > index 000000000000..0b5ef5d950b1
> > --- /dev/null
> > +++ b/tools/profile/perf/perf.c
>
> > +static struct perf_context_config perf_cfg = {
> > +     .perf_file = "#arch/perf",
> > +     .kpctl_file = "/prof/kpctl",
>
> It's safer to use #kprof instead of /prof.  We happen to have it bound in
> our
> current ifconfig, but there's no guarantee that'll always be the case.  In
> general, if you know the device by #name, use it directly.
>

OK, done.




>
>
> > +static void run_process_and_wait(int argc, const char * const *argv,
> > +                                                              const
> struct core_set *cores)
> > +{
> > +     int pid, status;
> > +     size_t max_cores = ros_total_cores();
> > +     struct core_set pvcores;
> > +
> > +     pid = sys_proc_create(argv[0], strlen(argv[0]), argv, NULL, 0);
> > +     if (pid < 0) {
> > +             perror(argv[0]);
> > +             exit(1);
> > +     }
> > +
> > +     ros_get_low_latency_core_set(&pvcores);
> > +     ros_not_core_set(&pvcores);
> > +     ros_and_core_sets(&pvcores, cores);
> > +     for (size_t i = 0; i < max_cores; i++) {
> > +             if (ros_get_bit(&pvcores, i)) {
> > +                     if (sys_provision(pid, RES_CORES, i)) {
> > +                             fprintf(stderr,
> > +                                             "Unable to provision CPU
> %lu to PID %d: cmd='%s'\n",
> > +                                             i, pid, argv[0]);
> > +                             exit(1);
> > +                     }
> > +             }
> > +     }
> > +
> > +     sys_proc_run(pid);
>
> As an FYI, the process will only run on the provisioned cores once it
> becomes an
> MCP.  If you are running an app that is an SCP, then it will never run on
> those
> cores.  If it will be an MCP, you won't capture the events from when it
> was on
> core 0 either.
>

So, are there better options?

Changes have been posted in the same branch.

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - Akaros perf tool

Reply via email to