Re: [akaros] Code Review - New Akaros profiler

'Davide Libenzi' via Akaros Thu, 12 Nov 2015 14:07:00 -0800

Branch pushed.


On Thu, Nov 12, 2015 at 2:00 PM, Davide Libenzi <[email protected]> wrote:

> On Thu, Nov 12, 2015 at 12:02 PM, Barret Rhoden <[email protected]>
> wrote:
>
>> > diff --git a/kern/include/profiler.h b/kern/include/profiler.h
>> > new file mode 100644
>> > index 000000000000..2ec9d4308632
>> > --- /dev/null
>> > +++ b/kern/include/profiler.h
>>
>> New, non-trivial files need a copyright header.
>>
>
> Already has, if you reviewed the tip of the branch.
>
>
>
>>
>> > @@ -0,0 +1,18 @@
>> > +
>> > +#ifndef ROS_KERN_INC_PROFILER_H
>> > +#define ROS_KERN_INC_PROFILER_H
>>
>> Can also do the pragma once now.
>>
>
> Already has.
>
>
>
>>
>> > diff --git a/kern/src/profiler.c b/kern/src/profiler.c
>> > new file mode 100644
>> > index 000000000000..ca7197bbc20b
>> > --- /dev/null
>> > +++ b/kern/src/profiler.c
>>
>> Needs a copyright.  It looks like this might be new code, but if there's
>> old stuff from oprofile in it, we need to put that.  I see there are a
>> couple things like "op_entry" still around.
>>
>
> Already has.
>
>
>
>
>>
>> > +static inline int profiler_add_sample(struct profiler_cpu_context
>> *cpu_buf,
>> > +
>>  uintptr_t pc, unsigned long event)
>>
>> Do you want this to return a bool (TRUE for success) or an int (0 for
>> success)?   See below:
>>
>
> This function is gone on tip.
> Since this was v2, and v1 was never reviewed, you likely commented the v1
> version.
>
>
>
>> > From 0a919619324782e873ddf5bbf9bd19e989f25162 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Tue, 20 Oct 2015 16:34:26 -0700
>> > Subject: Do not race when multiple init happen at the same time
>>
>> > @@ -65,53 +64,63 @@ static inline size_t
>> profiler_cpu_buffer_add_data(struct op_entry *entry,
>> >  static void free_cpu_buffers(void)
>> >  {
>> >       kfree(profiler_percpu_ctx);
>> > +     profiler_percpu_ctx = NULL;
>> > +
>> > +     qclose(profiler_queue);
>> > +     profiler_queue = NULL;
>> >  }
>>
>> Isn't this stuff possibly used from IRQ context?  What's the plan for
>> cleaning up such that we are sure that there are no interrupts going off
>> while you are freeing things?
>>
>
> That function is only called when there are no more users.
> Timers are stopped before calling profiler_cleanup().
> Not sure how sync is removal of a timer.
>
>
>
>>
>> In:
>> >  static int alloc_cpu_buffers(void)
>> >  {
>> > -     if (!profiler_queue) {
>> > -             profiler_queue = qopen(profiler_queue_limit, 0, NULL,
>> NULL);
>> > -             if (!profiler_queue)
>> > -                     goto fail;
>> > -     }
>> > +     int i;
>> > +
>> > +     profiler_queue = qopen(profiler_queue_limit, 0, NULL, NULL);
>> > +     if (!profiler_queue)
>> > +             return -ENOMEM;
>> >
>> > -     /* we *really* don't want to block. Losing data is better. */
>> >       qdropoverflow(profiler_queue, 1);
>>
>> Then later:
>>
>> > +     qnonblock(profiler_queue, 1);
>>
>> This is fragile.  qnonblock means "error with EAGAIN if we're full".
>> Right now, you'll get by, since you still have qdropoverflow above.  But
>> this could break with an error() thrown from IRQ context somewhere down
>> the line.
>>
>
> Yes, there are paired. I am not sure which kind of bad condition can
> happen.
>
>
>
>>
>> >  void profiler_cleanup(void)
>> >  {
>> > -     free_cpu_buffers();
>> > +     sem_down(&mtx);
>> > +     profiler_users--;
>> > +     if (profiler_users == 0)
>> > +             free_cpu_buffers();
>> > +     sem_up(&mtx);
>> >  }
>>
>> This might be a candidate for a kref, depending on how cleanup ends up
>> working.  No need to change it now though.
>>
>
> kref don't block until the 0->1 transitioner is initializing, no?
>
>
>
>> > From fd096815dc200a550ea5e6b6d7f133df75e29ed9 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Thu, 22 Oct 2015 14:30:26 -0700
>> > Subject: Added full binary path into the proc structure
>> >
>>
>> > @@ -791,6 +763,7 @@ static int sys_exec(struct proc *p, char *path,
>> size_t path_l,
>> >       t_path = copy_in_path(p, path, path_l);
>> >       if (!t_path)
>> >               return -1;
>> > +     proc_replace_binary_path(p, t_path);
>>
>> If we error out after this point, we'll have changed the path of the
>> binary to
>> the new one even though the exec failed.  I'm okay with that, just more
>> of an
>> FYI.
>>
>
> I am fixing this.
>
>
>
>
>>
>> > From 47bf34b1bc96b78a8091726cb3016b1f0d964847 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Thu, 22 Oct 2015 10:56:37 -0700
>> > Subject: Added APIs to access process startup information
>> >
>> > +char *get_startup_argv(struct proc *p, size_t idx, char *argp,
>> > +                                        size_t max_size)
>> > +{
>> > +     size_t stack_space = (const char *) USTACKTOP - (const char *)
>> p->args_base;
>> > +     const char *sptr = (const char *) p->args_base + sizeof(size_t) +
>> > +             idx * sizeof(char *);
>> > +     const char *argv = NULL;
>> > +
>> > +     /* TODO,DL: Use copy_from_user() when available.
>> > +      */
>>
>> You've got these now, though memcpy_from_user will do it. (or will soon).
>> Either way, you don't need the TODO.
>>
>
> The strcpy is needed. Currently it copies up to max size, which is kind of
> a hack.
> But I will not be fixing it here (in this CR, that is).
>
>
>
>
>>
>> > +     if (memcpy_from_user(p, &argv, sptr, sizeof(char *)))
>> > +             return NULL;
>> > +
>> > +     /* TODO,DL: Use strncpy_from_user() when available.
>> > +      */
>> > +     max_size = MIN(max_size, stack_space);
>> > +     if (memcpy_from_user(p, argp, argv, max_size))
>> > +             return NULL;
>> > +     argp[max_size - 1] = 0;
>> > +
>> > +     return argp;
>> > +}
>>
>> How important is it that the command line data hasn't changed?  We're
>> trusting
>> the user to not have changed it since they were invoked.
>
>
>> Also, I guess we're going to want to parse the argv stuff later.  Kevin
>> put
>> together some code to pack and parse the args earlier.  It might be
>> easier and
>> to keep that blob around and parse it instead of this.  Not a huge deal
>> though.
>>
>
> We could keep the kernel buffer copy, but I'd prefer to eventually do that
> in a separate CL.
>
>
>
>
>> > From 1aede03303debfe3075aa7055c6f1dd843f806bc Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Tue, 27 Oct 2015 15:29:10 -0700
>> > Subject: Added API to enumerate the VM regions of a process
>>
>> > --- a/kern/include/mm.h
>> > +++ b/kern/include/mm.h
>> > @@ -51,6 +51,9 @@ void isolate_vmrs(struct proc *p, uintptr_t va,
>> size_t len);
>> >  void unmap_and_destroy_vmrs(struct proc *p);
>> >  int duplicate_vmrs(struct proc *p, struct proc *new_p);
>> >  void print_vmrs(struct proc *p);
>> > +void enumerate_vrms(struct proc *p,
>>        ^enumerate_vmrs
>>
>
> Fixed. Took me a little to understand what was wrong ☺
>
>
>
>
>> > diff --git a/kern/src/mm.c b/kern/src/mm.c
>> > index ca91e63620b6..06a349840404 100644
>> > --- a/kern/src/mm.c
>> > +++ b/kern/src/mm.c
>> > @@ -359,6 +359,18 @@ void print_vmrs(struct proc *p)
>> >                      vmr->vm_file, vmr->vm_foff);
>> >  }
>> >
>> > +void enumerate_vrms(struct proc *p,
>>        ^enumerate_vmrs
>>
>
> Ditto.
>
>
>
>>
>> > From 2df52e6bc39bf90a1c90e5382298fa2c2a8316f5 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Thu, 5 Nov 2015 17:30:24 -0800
>> > Subject: Added API to append data into a circular buffer
>> >
>> > Added API to append data into a circular buffer. The buffer data is
>> > allocated once at init time, and no more allocations happen after that.
>>
>> > diff --git a/kern/include/circular_buffer.h
>> b/kern/include/circular_buffer.h
>> > new file mode 100644
>> > index 000000000000..4ab8c42e2673
>> > --- /dev/null
>> > +++ b/kern/include/circular_buffer.h
>>
>> Can you put in a little info about how to use this and what the rules
>> are?  For
>> instance, what happens to writers when the buffer is full (overwrite,
>> block, or
>> drop?) or to readers when the buffer is empty? (looks like you just get
>> 0).  If
>> a reader uses an offset, does that just drop a chunk from the buffer?
>> What is
>> the cbuf_size_t for?
>>
>
> Locking is up to the reader.
> In a circular buffer fashion, a write on full buffer discard oldest
> buffer, and write new one.
> No, reading data does not remove data. The only way the read ptr is moved
> forward, is when the write ptr pushes it forward in order to fit new data.
> Every block in the buffer, is pre-pended with its size. That is the type
> of the size.
>
>
>
>
>>
>> > --- /dev/null
>> > +++ b/kern/lib/circular_buffer.c
>>
>> > +void circular_buffer_clear(struct circular_buffer *cb)
>> > +{
>> > +     if (cb->base) {
>> > +             if (cb->mem)
>> > +                     kfree(cb->mem);
>>
>> It's a little surprising that clear also frees the allocation, instead of
>> just
>> resetting everything back to 0.
>>
>
> This clears all the resources of the buffer.
> I used "clear" as there is an "init".
> I would have used "free" if there was an "alloc".
>
>
>
>
>>
>> > +             cb->rdptr = cb->wrptr = cb->base = cb->mem = NULL;
>> > +             cb->size = cb->allocated = 0;
>> > +     }
>> > +}
>> > +
>> > +static bool circular_buffer_is_overlap(const struct circular_buffer
>> *cb,
>> > +
>>   const char *rptr, const char *wptr,
>> > +
>>   size_t size)
>> > +{
>> > +     return (cb->size > 0) && (rptr >= wptr) && (rptr < (wptr + size));
>> > +}
>> > +
>> > +static void circular_buffer_write_skip(struct circular_buffer *cb,
>> char *wrptr,
>> > +
>>   size_t size)
>>
>> What are these helpers doing?
>>
>
> The overlap check is a write operation is going to overwrite the block at
> which rdptr resides.
> The skip, moves forward the rdptr until it no more overlaps with the
> incoming write.
>
>
>
>
>>
>> > +size_t circular_buffer_read(struct circular_buffer *cb, char *data,
>> size_t size,
>> > +                                                     size_t off)
>> > +{
>> > +     size_t asize = cb->size, rsize = 0;
>> > +     const char *rdptr = cb->rdptr;
>> > +
>> > +     while (asize > 0 && size > 0) {
>> > +             size_t bsize = *(const cbuf_size_t *) rdptr;
>> > +
>> > +             if (likely(bsize)) {
>> > +                     size_t esize = bsize - sizeof(cbuf_size_t);
>> > +
>> > +                     if (off >= esize) {
>> > +                             off -= esize;
>> > +                     } else {
>> > +                             size_t csize = MIN(esize - off, size);
>> > +
>> > +                             memcpy(data, rdptr + sizeof(cbuf_size_t)
>> + off, csize);
>>
>> So every block of data has the size first, then the data, all packed
>> together?
>>
>
> Yup.
>
>
>
>
>>
>> > From c3c2bb0701f747e5c5c3f3d30cdf033137ae26e2 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Wed, 21 Oct 2015 16:39:04 -0700
>> > Subject: Implemented the new profiler
>> >
>> > Implemented the new profiler format and added simple userspace
>> > stack trace (waiting for copy_from_user()).
>>
>> This is a huge commit that would have been easier to understand if it was
>> broken
>> up a bit.  For instance, I see there's a user backtrace list, syscall
>> tracing
>> isn't using a qio queue anymore, the old Plan 9 profiler is gone
>> (great!), and a
>> bunch of other things.  It also looks like you squeezed in the change
>> that the
>> start command also turns on the timers for all cores, which isn't
>> something I
>> wanted.
>>
>
> Yes, it is a big change 😑
> Braking it down would be a PITA at this point, since accumulated changes
> in the last 4 weeks that has been sitting there.
> As for the timers, for Ron's case, you can still turn them on aside from
> the profiler.
> For a profiler POV, I think the default behavior should be them on
> everywhere, otherwise you lose the full picture of what the system is doing.
> We could add a selective CPU timer and profiler turn on, but IMO the
> default should be all on.
>
>
>
>
>>
>> > --- a/kern/arch/x86/kdebug.c
>> > +++ b/kern/arch/x86/kdebug.c
>> > @@ -365,6 +365,25 @@ size_t backtrace_list(uintptr_t pc, uintptr_t fp,
>> uintptr_t *pcs,
>> >       return nr_pcs;
>> >  }
>> >
>> > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t *pcs,
>> > +                                                size_t nr_slots)
>> > +{
>> > +     size_t nr_pcs = 0;
>> > +
>> > +     for (;;) {
>> > +             if (unlikely(fp >= UMAPTOP) || unlikely(fp < BRK_END) ||
>>
>> Why not is_user_raddr()?  And once you use copy_from, those checks aren't
>> needed
>> either.  Also, I think it is possible for frame pointers to be below
>> BRK_END,
>> which could happen if a user stack was malloc'd instead of mmapped.
>>
>> > +                     unlikely(nr_pcs >= nr_slots))
>> > +                     break;
>> > +
>> > +             /* For now straight memory access, waiting for
>> copy_from_user(). */
>>
>> copy_from is now available.  =)
>>
>
> I will do that is a separate CL.
>
>
>
> > diff --git a/kern/drivers/dev/kprof.c b/kern/drivers/dev/kprof.c
>> > index da11b7e44d32..76e53d7f9d30 100644
>>
>> > +static int oprof_timer_period = 1000;
>>              ^ rename prof_timer_period
>>
>
> Done.
>
>
>
>
>>
>> > +static void kprof_start_profiler(void)
>> > +{
>> > +     ERRSTACK(2);
>> >
>> > +     sem_down(&kprof.lock);
>>
>> If you're using a semaphore as a binary mutex (lock), please use a
>> qlock.  This
>> happens a bunch in this patch set.
>>
>
> Done.
>
>
>
>>
>> > +     if (waserror()) {
>> > +             sem_up(&kprof.lock);
>> > +             nexterror();
>> > +     }
>> > +     if (!kprof.profiling) {
>> > +             profiler_init();
>> > +             if (waserror()) {
>> > +                     profiler_cleanup();
>> > +                     nexterror();
>> > +             }
>> > +
>> > +             profiler_control_trace(1);
>> > +
>> > +             for (int i = 0; i < num_cores; i++)
>> > +                     kprof_enable_timer(i, 1);
>>
>> This looks like starting the profiler turns on the timer.  I wanted to
>> keep the
>> timer separate from the collection, such that the timer is one of
>> potentially
>> many sources of data, as discussed in another email.
>>
>
> For future uses, yes. But today, with no timers, there is no data.
> kptrace is a different data path, which is always on.
> When we do, in the code I am working now, counter overflow based interrupt
> triggers, then we need to rework this file.
>
>
>
>
>
>>
>> > +static void kprof_usage_fail(void)
>> > +{
>> > +     static const char *ctlstring = "clear|start|stop|timer";
>> > +     const char* const * cmds = profiler_configure_cmds();
>> > +     size_t i, size, csize;
>> > +     char msgbuf[128];
>> > +
>> > +     strncpy(msgbuf, ctlstring, sizeof(msgbuf));
>> > +     size = MIN(strlen(ctlstring), sizeof(msgbuf) - 1);
>> > +     for (i = 0; cmds[i]; i++) {
>> > +             csize = strlen(cmds[i]);
>> > +             if ((csize + size + 2) > sizeof(msgbuf))
>> > +                     break;
>> > +             msgbuf[size] = '|';
>> > +             memcpy(msgbuf + size + 1, cmds[i], csize);
>> > +             size += csize + 1;
>> > +     }
>> > +     msgbuf[size] = 0;
>> > +
>> > +     error(EFAIL, msgbuf);
>> > +}
>>
>> This is cool, and the pattern tends to come up a lot, where we'd like to
>> generate usage information.  Perhaps there's a Plan 9 helper for this
>> already?
>> (I don't know off the top of my head, maybe the Plan 9 guys do).  We
>> already
>> have stuff that parses command strings.  If we don't have a helper, we
>> can make
>> one, so that every device doesn't have to redo this logic.
>>
>> Another part to it is that a device has several layers of arguments and
>> commands
>> - it'd be neat if that was clearly written down somewhere and then we
>> could
>> generate these switches and error messages more easily.  This isn't
>> something
>> that needs to be done now, but I like your kprof_usage_fail and it
>> brought the
>> topic up.  =)
>>
>
> OK ☺
>
>
>
>
>>
>> > +static struct trace_printk_buffer *kprof_get_printk_buffer(void)
>> > +{
>> > +     static struct trace_printk_buffer boot_tpb;
>> > +     static struct trace_printk_buffer *cpu_tpbs;
>> > +
>> > +     if (unlikely(booting))
>> > +             return &boot_tpb;
>> > +     if (unlikely(!cpu_tpbs)) {
>> > +             /* Poor man per-CPU data structure. I really do no like
>> littering global
>> > +              * data structures with module specific data.
>> > +              */
>> > +             spin_lock_irqsave(&ktrace_lock);
>> > +             if (!cpu_tpbs)
>> > +                     cpu_tpbs = kzmalloc(num_cores * sizeof(struct
>> trace_printk_buffer),
>> > +                                                             0);
>>
>> If this alloc fails, (flags == 0), then we'll return 0 (+coreid) and PF
>> later.
>>
>
> We could panic, or return boot tpb.
> Today a OOM crashes the box, so I am not sure what to do here.
>
>
>
> > --- a/kern/include/kdebug.h
>> > +++ b/kern/include/kdebug.h
>> > @@ -46,7 +48,7 @@ int printdump(char *buf, int buflen, uint8_t *data);
>> >  extern bool printx_on;
>> >  void set_printx(int mode);
>> >  #define printx(args...) if (printx_on) printk(args)
>> > -#define trace_printx(args...) if (printx_on) trace_printk(args)
>> > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE, args)
>>                                                             ^FALSE?
>> We just want to print text, not a backtrace.  That was also my intent for
>> trace_printk (it's just the print, not a backtrace).
>>
>
> Seems pretty useful to have the option to get a BT, to understand from
> which path you are coming from.
> A "grep" of the stirng simply tells where the print is.
>
>
>
>>
>> > --- a/kern/src/mm.c
>> > +++ b/kern/src/mm.c
>> > @@ -24,6 +24,7 @@
>> >  #include <kmalloc.h>
>> >  #include <vfs.h>
>> >  #include <smp.h>
>> > +#include <profiler.h>
>> >
>> >  struct kmem_cache *vmr_kcache;
>> >
>> > @@ -692,6 +693,9 @@ void *do_mmap(struct proc *p, uintptr_t addr,
>> size_t len, int prot, int flags,
>> >               }
>> >       }
>> >       spin_unlock(&p->vmr_lock);
>> > +
>> > +     profiler_notify_mmap(p, addr, len, prot, flags, file, offset);
>> > +
>>
>> Do you need to know when something was unmapped?
>>
>
> Linux perf don't seem to care, as it does not have any command to report
> that.
>
>
> > diff --git a/kern/src/process.c b/kern/src/process.c
>> > index bd26789e3ef1..2e4c69518e8a 100644
>> > --- a/kern/src/process.c
>> > +++ b/kern/src/process.c
>> > @@ -331,6 +331,8 @@ error_t proc_alloc(struct proc **pp, struct proc
>> *parent, int flags)
>> >               kmem_cache_free(proc_cache, p);
>> >               return -ENOFREEPID;
>> >       }
>> > +     if (parent && parent->binary_path)
>> > +             kstrdup(&p->binary_path, parent->binary_path);
>>
>> Is this to fix a leak where paths weren't set for forked processes?  If
>> so, this
>> is an example of a fixup that could have been squashed into a previous
>> commit.
>>
>
> I lost track for what it was 😀
> But yes, could have been possibly squashed.
>
>
>
>
>>
>> > --- a/kern/src/profiler.c
>> > +++ b/kern/src/profiler.c
>>
>> > +static inline char* vb_encode_uint64(char* data, uint64_t n)
>> > +{
>> > +     for (; n >= 0x80; n >>= 7)
>> > +             *data++ = (char) (n | 0x80);
>> > +     *data++ = (char) n;
>> > +
>> > +     return data;
>> > +}
>>
>> This probably should be in a library or header (so we can reuse it) and
>> with a
>> little explanation.  It looks like we take 7 bits of n at a time and push
>> it
>> into *data, with bit 0x80 signifying we're done or not.
>>
>
> Yep, that's what t does.
> I wasn't sure if variable byte encoding would be something a kernel would
> be doing elsewhere.
>
>
>
> > +static struct block *profiler_buffer_write(struct profiler_cpu_context
>> *cpu_buf,
>> > +
>>           struct block *b)
>> >  {
>> > -     return (((uint64_t) 0xee01) << 48) | ((uint64_t) cpu << 16) |
>> > -             (uint64_t) nbt;
>> > +     if (b) {
>> > +             qibwrite(profiler_queue, b);
>> > +
>> > +             if (qlen(profiler_queue) > profiler_queue_limit) {
>> > +                     b = qget(profiler_queue);
>> > +                     if (likely(b)) {
>> > +                             cpu_buf->dropped_data_size += BLEN(b);
>> > +                             freeb(b);
>> > +                     }
>>
>> This seems like a candidate for a feature added to qio, such as a "drop
>> from the
>> front" mode.
>>
>
> OK, but can we please do that in a separate CL? This branch has been
> sitting for some time already.
>
>
>
>
>>
>> > +             }
>> > +     }
>> > +
>> > +     return iallocb(profiler_cpu_buffer_size);
>> >  }
>>
>> This function seems a little weird.  If you're given a block, you write
>> it.
>> That makes sense.  But regardless, you alloc a new block?
>>
>> Seems like we should not be too cute by having write return a fresh
>> block, and
>> just do the allocation separately.
>>
>
> I drop a block from the tail. I *could* be reusing it, but I have found no
> reinit API to re initialized a block.
>
>
>
>
>>
>> > +static void profiler_emit_current_system_status(void)
>> >  {
>> > -     struct block *b = cpu_buf->block;
>> > -    size_t totalsize = sizeof(struct op_sample) +
>> > -             size * sizeof(entry->sample->data[0]);
>> > -
>> > -     if (unlikely((!b) || (b->lim - b->wp) < totalsize)) {
>> > -             if (b)
>> > -                     qibwrite(profiler_queue, b);
>> > -             /* For now. Later, we will grab a block off the
>> > -              * emptyblock queue.
>> > -              */
>> > -             cpu_buf->block = b = iallocb(profiler_cpu_buffer_size);
>> > -        if (unlikely(!b)) {
>> > -                     printk("%s: fail\n", __func__);
>> > -                     return NULL;
>> > +     void enum_proc(struct vm_region *vmr, void *opaque)
>> > +     {
>> > +             struct proc *p = (struct proc *) opaque;
>> > +
>> > +             profiler_notify_mmap(p, vmr->vm_base, vmr->vm_end -
>> vmr->vm_base,
>> > +                                                      vmr->vm_prot,
>> vmr->vm_flags, vmr->vm_file,
>> > +                                                      vmr->vm_foff);
>> > +     }
>> > +
>> > +     ERRSTACK(2);
>>
>> I think you only need 1 here.  It's based on the number of waserrors.
>>
>
> Fixed.
>
>
>
>
>>
>> > +     struct process_set pset;
>> > +
>> > +     proc_get_set(&pset);
>> > +     if (waserror()) {
>> > +             proc_free_set(&pset);
>> > +             nexterror();
>> > +     }
>> > +
>> > +     for (size_t i = 0; i < pset.num_processes; i++)
>> > +             enumerate_vrms(pset.procs[i], enum_proc, pset.procs[i]);
>>                  ^vmrs
>>
>
> Done.
>
>
>
>>
>> > +     poperror();
>> > +     proc_free_set(&pset);
>> > +}
>> > +
>> > +static inline int profiler_is_tracing(struct profiler_cpu_context
>> *cpu_buf)
>>
>> Since this is named "is_tracing", it should be a bool, and return
>> TRUE/False to
>> avoid any confusion.
>>
>
> Done.
>
>
>
> > -static inline int profiler_add_sample(struct profiler_cpu_context
>> *cpu_buf,
>> > -
>>  uintptr_t pc, unsigned long event)
>> > +static void alloc_cpu_buffers(void)
>> >  {
>> >       ERRSTACK(1);
>> > -     struct op_entry entry;
>> > -     struct block *b;
>> > +     int i;
>> >
>> > +     profiler_queue = qopen(profiler_queue_limit, 0, NULL, NULL);
>> > +     if (!profiler_queue)
>> > +             error(-ENOMEM, NULL);
>>               ^just ENOMEM
>>
>
> Done.
>
>
> > +     qdropoverflow(profiler_queue, 1);
>> > +     qnonblock(profiler_queue, 1);
>>
>> As mentioned in a previous patch, you might not want qnonblock.
>>
>
> Not sure why not?
>
>
>
> Incidentally, qnonblock and qdropoverflow take a bool, not an int.  So TRUE
>> instead of 1.
>>
>
> Done.
>
>
>
>>
>> > +     profiler_percpu_ctx =
>> > +             kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores,
>> KMALLOC_WAIT);
>> > +     if (!profiler_percpu_ctx)
>> > +             error(-ENOMEM, NULL);
>>
>> Since you did a KMALLOC_WAIT, you will never have !profiler_percpu_ctx
>>
>
> I am not sure waiting is a guarantee that you will get memory anyway.
>
>
> > -static inline void profiler_begin_trace(struct profiler_cpu_context
>> *cpu_buf)
>> > +int profiler_configure(struct cmdbuf *cb)
>> >  {
>> > -     cpu_buf->tracing = 1;
>> > +     if (!strcmp(cb->f[0], "prof_qlimit")) {
>> > +             if (cb->nf < 2)
>> > +                     error(EFAIL, "prof_qlimit KB");
>> > +             profiler_queue_limit = atoi(cb->f[1]) * 1024;
>>
>> Probably want a sanity check.  Also, what happens when this changes after
>> the
>> queue was already allocated?
>>
>
> The configuration must be done before the profiler is started.
>
>
>
>
>>
>> > +     } else if (!strcmp(cb->f[0], "prof_cpubufsz")) {
>> > +             if (cb->nf < 2)
>> > +                     error(EFAIL, "prof_cpubufsz KB");
>> > +             profiler_cpu_buffer_size = atoi(cb->f[1]) * 1024;
>>
>> Is there any danger with the user setting this to be a very small value
>> (like
>> 0)?   It looks like the assumption in profiler_cpu_buffer_write_reserve()
>> is
>> that a fresh allocation (done by profiler_buffer_write()) is enough for
>> size
>> bytes.
>>
>> > +     } else if (!strcmp(cb->f[0], "prof_btdepth")) {
>> > +             if (cb->nf < 2)
>> > +                     error(EFAIL, "prof_btdepth DEPTH");
>> > +             profiler_backtrace_depth = atoi(cb->f[1]);
>>
>> It is dangerous to have the user control this.  It's a stack allocation.
>> Even
>> if it wasn't, we'd need a sanity check of some sort.
>>
>
> I will drop the BT depth config, and limit the others.
>
>
>
>>
>> > +void profiler_cleanup(void)
>> > +{
>> > +     sem_down(&profiler_mtx);
>> > +     profiler_users--;
>> > +     if (profiler_users == 0)
>> > +             free_cpu_buffers();
>> > +     sem_up(&profiler_mtx);
>> >  }
>>
>> I'm still concerned about this.  If the only source of profiling data is
>> from
>> the timer IRQs, then your current stuff is seems fine.  (You disable the
>> timers,
>> and thus the handlers) before freeing this stuff).  But if we ever have
>> any
>> other use of this, then we'll need to be careful.
>>
>
> As I said, when I come in with the overflow-triggered sampling, this will
> need some retuning.
>
>
>
>>
>> But there's more to it than that.  The assumption here is that once
>> profiler_users == 0, then there is no code, (RKM, pending IRQ, etc.) that
>> has
>> access to profiler_percpu_ctx or profiler_queue.
>>
>> Just as an example, say someone is calling profiler_notify_mmap(), which
>> checks
>> profiler_percpu_ctx.  They get past the if () check, then concurrently
>> someone
>> calls profiler_cleanup and triggers free_cpu_buffers.  Then the original
>> thread
>> eventually calls profiler_push_pid_mmap, then qwrite, then page faults.
>>
>> So a user could theoretically trigger a PF.  This is part of the reason
>> why I
>> was reluctant to have you try and free the buffers.
>>
>
> Well, we have no RCU ☺
> In such case we should swap ptr with NULL, and RCU-free it.
>
>
>
>
> > @@ -572,6 +571,7 @@ static int sys_proc_create(struct proc *p, char
>> *path, size_t path_l,
>> >       user_memdup_free(p, kargenv);
>> >       __proc_ready(new_p);
>> >       pid = new_p->pid;
>> > +     profiler_notify_new_process(new_p);
>> >       proc_decref(new_p);     /* give up the reference created in
>> proc_create() */
>> >       return pid;
>> >  error_load_elf:
>> > @@ -728,6 +728,7 @@ static ssize_t sys_fork(env_t* e)
>> >
>> >       printd("[PID %d] fork PID %d\n", e->pid, env->pid);
>> >       ret = env->pid;
>> > +     profiler_notify_new_process(env);
>> >       proc_decref(env);       /* give up the reference created in
>> proc_alloc() */
>> >       return ret;
>> >  }
>>
>> Do you need to update things when a process changes its binary path?
>> (exec)
>>
>
> That should lead to new mmaps.
>
>
>
> > From c0ab4ec0d729ad8e0555b3beef72340f90c23712 Mon Sep 17 00:00:00 2001
>> > From: Davide Libenzi <[email protected]>
>> > Date: Sat, 7 Nov 2015 18:47:16 -0800
>> > Subject: Enabled /prof/kptrace collection of anything which goes into
>> cprintf
>> >
>> > Enabled /prof/kptrace collection of anything which goes into cprintf,
>> > printk, and its associates.
>> > The kptrace collector is a circular buffer whose default size is 128KB.
>>
>> What happens when the ring is full?  New stuff drops?  My main concern
>> here is
>> that trace_vprintk could somehow cause an issue and hold up the real
>> printk,
>> which would be hard to debug.
>>
>
> Ring full drops old stuff. Never blocks.
>
>
>
>
>>
>> > --- a/kern/drivers/dev/kprof.c
>> > +++ b/kern/drivers/dev/kprof.c
>> > @@ -19,6 +19,7 @@
>> >  #include <error.h>
>> >  #include <pmap.h>
>> >  #include <smp.h>
>> > +#include <time.h>
>> >  #include <circular_buffer.h>
>> >  #include <umem.h>
>> >  #include <profiler.h>
>> > @@ -62,7 +63,7 @@ struct dirtab kproftab[] = {
>> >       {"mpstat-raw",  {Kmpstatrawqid},        0,      0600},
>> >  };
>> >
>> > -extern int booting;
>> > +extern system_timing_t system_timing;
>>
>> Should be able to get system_timing from a header.
>>
>
> Yup, arch/time.h, done.
>
>
>
>
>>
>> >  static struct kprof kprof;
>> >  static bool ktrace_init_done = FALSE;
>> >  static spinlock_t ktrace_lock = SPINLOCK_INITIALIZER_IRQSAVE;
>> > @@ -567,7 +568,7 @@ static struct trace_printk_buffer
>> *kprof_get_printk_buffer(void)
>> >       static struct trace_printk_buffer boot_tpb;
>> >       static struct trace_printk_buffer *cpu_tpbs;
>> >
>> > -     if (unlikely(booting))
>> > +     if (unlikely(!num_cores))
>> >               return &boot_tpb;
>>
>> That seems a little odd.  I'd think if we're still booting, we'd use the
>> boot_tpb.  Was there some corner case that triggered this?  I understand
>> this
>> one:
>>
>> > -     if (likely(!booting))
>> > +     if (likely(system_timing.tsc_freq))
>> >               tsc2timespec(read_tsc(), &ts_now);
>>
>
> The booting flag is cleared really late, and I would not want stuff
> starting spinning on other cores, colliding on boot tbp.
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - New Akaros profiler

Reply via email to