Re: [akaros] Code Review - New Akaros profiler

barret rhoden Fri, 13 Nov 2015 09:20:13 -0800

On 2015-11-12 at 14:00 'Davide Libenzi' via Akaros wrote:
> This function is gone on tip.
> Since this was v2, and v1 was never reviewed, you likely commented
> the v1 version.


I went off of:

> The following changes since commit
> 6f3723cd8f883260a78fdf411911d7469464caa5:
> 
>   Update file-posix.c utest (2015-10-15 12:07:00 -0400)
> 
> are available in the git repository at:
> 
>   [email protected]:dlibenzi/akaros profiler_dev_v2
> 
> for you to fetch changes up to
> 6e40b75927c808dd945cb3ad0ee6e52a5a2dc495:
> 
>   Added simple netcat client for Akaros (2015-10-28 14:49:56 -0700)

> > > @@ -65,53 +64,63 @@ static inline size_t  
> > profiler_cpu_buffer_add_data(struct op_entry *entry,  
> > >  static void free_cpu_buffers(void)
> > >  {
> > >       kfree(profiler_percpu_ctx);
> > > +     profiler_percpu_ctx = NULL;
> > > +
> > > +     qclose(profiler_queue);
> > > +     profiler_queue = NULL;
> > >  }  
> >
> > Isn't this stuff possibly used from IRQ context?  What's the plan
> > for cleaning up such that we are sure that there are no interrupts
> > going off while you are freeing things?
> >  
> 
> That function is only called when there are no more users.
> Timers are stopped before calling profiler_cleanup().
> Not sure how sync is removal of a timer.

> > > */ qdropoverflow(profiler_queue, 1);  
> >
> > Then later:
> >  
> > > +     qnonblock(profiler_queue, 1);  
> >
> > This is fragile.  qnonblock means "error with EAGAIN if we're full".
> > Right now, you'll get by, since you still have qdropoverflow
> > above.  But this could break with an error() thrown from IRQ
> > context somewhere down the line.
> >  
> 
> Yes, there are paired. I am not sure which kind of bad condition can
> happen.

The bad condition is that you may unexpectedly get an error().  If you
have dropoverflow, nonblock does nothing, so do not set it.  things may
change in qio in the future.

> > >  void profiler_cleanup(void)
> > >  {
> > > -     free_cpu_buffers();
> > > +     sem_down(&mtx);
> > > +     profiler_users--;
> > > +     if (profiler_users == 0)
> > > +             free_cpu_buffers();
> > > +     sem_up(&mtx);
> > >  }  
> >
> > This might be a candidate for a kref, depending on how cleanup ends
> > up working.  No need to change it now though.
> >  
> 
> kref don't block until the 0->1 transitioner is initializing, no?

krefs never block, so i don't fully understand.  my comment was that
cleanup seems dangerous here.  this code is hand-rolling the kref
pattern, so in the end a kref might be the right tool here.

> > How important is it that the command line data hasn't changed?
> > We're trusting
> > the user to not have changed it since they were invoked.  
> 
> 
> > Also, I guess we're going to want to parse the argv stuff later.
> > Kevin put together some code to pack and parse the args earlier.
> > It might be easier and
> > to keep that blob around and parse it instead of this.  Not a huge
> > deal though.
> >  
> 
> We could keep the kernel buffer copy, but I'd prefer to eventually do
> that in a separate CL.

sounds fine.

> Locking is up to the reader.
> In a circular buffer fashion, a write on full buffer discard oldest
> buffer, and write new one.
> No, reading data does not remove data. The only way the read ptr is
> moved forward, is when the write ptr pushes it forward in order to
> fit new data. Every block in the buffer, is pre-pended with its size.
> That is the type of the size.

Great.  Please put that in the source too.

> > It's a little surprising that clear also frees the allocation,
> > instead of just
> > resetting everything back to 0.
> >  
> 
> This clears all the resources of the buffer.
> I used "clear" as there is an "init".
> I would have used "free" if there was an "alloc".

This was originally unclear since it was not clear how the contents of
the ring buffer get discarded.  we have another ring buffer in the
kernel that takes data, then stops taking *new* data until the old
contents were 'cleared.'  

either way, 'clear' doesn't sound like destroy, free, deinit, or
shutdown.  all of which have a more 'final' sound to it.

> > What are these helpers doing?
> >  
> 
> The overlap check is a write operation is going to overwrite the
> block at which rdptr resides.
> The skip, moves forward the rdptr until it no more overlaps with the
> incoming write.

Thanks.  Please put that in a comment right above the functions.


> > This is a huge commit that would have been easier to understand if
> > it was broken
> > up a bit.  For instance, I see there's a user backtrace list,
> > syscall tracing
> > isn't using a qio queue anymore, the old Plan 9 profiler is gone
> > (great!), and a
> > bunch of other things.  It also looks like you squeezed in the
> > change that the
> > start command also turns on the timers for all cores, which isn't
> > something I
> > wanted.
> >  
> 
> Yes, it is a big change ____
> Braking it down would be a PITA at this point, since accumulated
> changes in the last 4 weeks that has been sitting there.

Agreed, this is more of a "please try to avoid this in the future
comment"

> As for the timers, for Ron's case, you can still turn them on aside
> from the profiler.
> For a profiler POV, I think the default behavior should be them on
> everywhere, otherwise you lose the full picture of what the system is
> doing. We could add a selective CPU timer and profiler turn on, but
> IMO the default should be all on.

One problem is I don't see how to turn on the profiler on a single
core.  If you start collecting, you start the timer on all cores.  I
don't see why it can't be the way it was before.  "Pick the timers you
want, then turn on the accounting."  

I think fix is to have multiple commands.  One is the timer (as is),
one is to start just the collection, and one is the aggregate: all
timers and start the collection.  This is a combination of all the
approaches.  There's our existing style (timer, then collection) as
well as your desired style (just turn the whole thing on damnit).

> > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t
> > > *pcs,
> > > +                                                size_t nr_slots)
> > > +{
> > > +     size_t nr_pcs = 0;
> > > +
> > > +     for (;;) {
> > > +             if (unlikely(fp >= UMAPTOP) || unlikely(fp <
> > > BRK_END) ||  
> >
> > Why not is_user_raddr()?  And once you use copy_from, those checks
> > aren't needed
> > either.  Also, I think it is possible for frame pointers to be below
> > BRK_END,
> > which could happen if a user stack was malloc'd instead of mmapped.
> >  
> > > +                     unlikely(nr_pcs >= nr_slots))
> > > +                     break;
> > > +
> > > +             /* For now straight memory access, waiting for  
> > copy_from_user(). */
> >
> > copy_from is now available.  =)
> >  
> 
> I will do that is a separate CL.

The beauty of Git is that you can make that small change (just use
copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit",
then do a rebase -i and meld that fixup into this existing commit.

> > > +             for (int i = 0; i < num_cores; i++)
> > > +                     kprof_enable_timer(i, 1);  
> >
> > This looks like starting the profiler turns on the timer.  I wanted
> > to keep the
> > timer separate from the collection, such that the timer is one of
> > potentially
> > many sources of data, as discussed in another email.
> >  
> 
> For future uses, yes. But today, with no timers, there is no data.
> kptrace is a different data path, which is always on.
> When we do, in the code I am working now, counter overflow based
> interrupt triggers, then we need to rework this file.

I think multiple commands should still do the trick.  But if not, we
can talk about it.  =)

> > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void)
> > > +{
> > > +     static struct trace_printk_buffer boot_tpb;
> > > +     static struct trace_printk_buffer *cpu_tpbs;
> > > +
> > > +     if (unlikely(booting))
> > > +             return &boot_tpb;
> > > +     if (unlikely(!cpu_tpbs)) {
> > > +             /* Poor man per-CPU data structure. I really do no
> > > like  
> > littering global  
> > > +              * data structures with module specific data.
> > > +              */
> > > +             spin_lock_irqsave(&ktrace_lock);
> > > +             if (!cpu_tpbs)
> > > +                     cpu_tpbs = kzmalloc(num_cores *
> > > sizeof(struct  
> > trace_printk_buffer),  
> > > +
> > > 0);  
> >
> > If this alloc fails, (flags == 0), then we'll return 0 (+coreid)
> > and PF later.
> >  
> 
> We could panic, or return boot tpb.
> Today a OOM crashes the box, so I am not sure what to do here.

Could do the alloc with KMALLOC_WAIT outside the lock.  Then lock and
try to do the assign.  If someone else already assigned cpu_tpbs, then
we free the one we just got.

> > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE,
> > > args)  
> >                                                             ^FALSE?
> > We just want to print text, not a backtrace.  That was also my
> > intent for trace_printk (it's just the print, not a backtrace).
> >  
> 
> Seems pretty useful to have the option to get a BT, to understand from
> which path you are coming from.
> A "grep" of the stirng simply tells where the print is.

I like the option for the backtrace.  I'm just saying that i'd like a
simple trace_printk/printx that doesn't trigger the backtrace.  we
already have TRACEME for the backtrace trace_printk.

Anyway, if you don't want to fix it, that's fine.  If I get annoyed at
a long backtrace in the future when trace_printxing I'll change it.

> > > +     profiler_notify_mmap(p, addr, len, prot, flags, file,
> > > offset);
> > > +  
> >
> > Do you need to know when something was unmapped?
> >  
> 
> Linux perf don't seem to care, as it does not have any command to
> report that.

Sounds good, just checking.  =)

> > This probably should be in a library or header (so we can reuse it)
> > and with a
> > little explanation.  It looks like we take 7 bits of n at a time
> > and push it
> > into *data, with bit 0x80 signifying we're done or not.
> >  
> 
> Yep, that's what t does.

Cool, please note that in the code.  I have a bad memory.  =)

> I wasn't sure if variable byte encoding would be something a kernel
> would be doing elsewhere.

I wasn't sure either.  It sounds pretty cool.  We can keep it here for
now, or move it elsewhere if you'd like.

> > +static struct block *profiler_buffer_write(struct
> > profiler_cpu_context *cpu_buf,  
> > > +  
> >         struct block *b)  
> > >  {
> > > -     return (((uint64_t) 0xee01) << 48) | ((uint64_t) cpu << 16)
> > > |
> > > -             (uint64_t) nbt;
> > > +     if (b) {
> > > +             qibwrite(profiler_queue, b);
> > > +
> > > +             if (qlen(profiler_queue) > profiler_queue_limit) {
> > > +                     b = qget(profiler_queue);
> > > +                     if (likely(b)) {
> > > +                             cpu_buf->dropped_data_size +=
> > > BLEN(b);
> > > +                             freeb(b);
> > > +                     }  
> >
> > This seems like a candidate for a feature added to qio, such as a
> > "drop from the
> > front" mode.
> >  
> 
> OK, but can we please do that in a separate CL? This branch has been
> sitting for some time already.

Ok.  This is a minor thing, but I mentioned this because its an example
of a place where if we slowed down a little and saw the opportunity to
change qio, the whole kernel could be better.

As far as the age of this branch goes, it's a huge change that takes a
while to get through.  Maybe I'm just slow, but I've spent 2 days on
this and counting.

> > > +     return iallocb(profiler_cpu_buffer_size);
> > >  }  
> >
> > This function seems a little weird.  If you're given a block, you
> > write it. That makes sense.  But regardless, you alloc a new block?
> >
> > Seems like we should not be too cute by having write return a fresh
> > block, and
> > just do the allocation separately.
> >  
> 
> I drop a block from the tail. I *could* be reusing it, but I have
> found no reinit API to re initialized a block.

Ah, I see.  In the future, you know you were given a block and you want
to return that block in a ready-to-go manner.

This is another case where we can make the function you need and add it
to the block code.  That way you also don't need to worry about
checking the return value of this function (since iallocb can fail, but
if you return the block you got originally, that can't fail).

Not to harp on this too much, but I would rather have had that block
code done as a support patch before this commit, and then used here,
than doing it the way it is.  That also would have been clearer to me,
would have improved the kernel, and would have saved some time.  =)

But feel free to leave it as is now, and we can address that with a
future TODO.

> > +     qdropoverflow(profiler_queue, 1);  
> > > +     qnonblock(profiler_queue, 1);  
> >
> > As mentioned in a previous patch, you might not want qnonblock.
> >  
> 
> Not sure why not?

Because they do similar things in different ways.

> > > +     profiler_percpu_ctx =
> > > +             kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores,  
> > KMALLOC_WAIT);  
> > > +     if (!profiler_percpu_ctx)
> > > +             error(-ENOMEM, NULL);  
> >
> > Since you did a KMALLOC_WAIT, you will never
> > have !profiler_percpu_ctx 
> 
> I am not sure waiting is a guarantee that you will get memory anyway.

I think it is in other OSs.  In Akaros, the implementation isn't done
yet, but the design is that if you say KMALLOC_WAIT (or whatever we
call it later), that your thread will block until the memory allocation
succeeds.

> > -static inline void profiler_begin_trace(struct profiler_cpu_context
> > *cpu_buf)  
> > > +int profiler_configure(struct cmdbuf *cb)
> > >  {
> > > -     cpu_buf->tracing = 1;
> > > +     if (!strcmp(cb->f[0], "prof_qlimit")) {
> > > +             if (cb->nf < 2)
> > > +                     error(EFAIL, "prof_qlimit KB");
> > > +             profiler_queue_limit = atoi(cb->f[1]) * 1024;  
> >
> > Probably want a sanity check.  Also, what happens when this changes
> > after the
> > queue was already allocated?
> >  
> 
> The configuration must be done before the profiler is started.

Ok, then we need a check to return an error for this case.  Otherwise a
buggy user could change that setting on the fly.

> > > +     } else if (!strcmp(cb->f[0], "prof_cpubufsz")) {
> > > +             if (cb->nf < 2)
> > > +                     error(EFAIL, "prof_cpubufsz KB");
> > > +             profiler_cpu_buffer_size = atoi(cb->f[1]) * 1024;  
> >
> > Is there any danger with the user setting this to be a very small
> > value (like
> > 0)?   It looks like the assumption in
> > profiler_cpu_buffer_write_reserve() is
> > that a fresh allocation (done by profiler_buffer_write()) is enough
> > for size
> > bytes.
> >  
> > > +     } else if (!strcmp(cb->f[0], "prof_btdepth")) {
> > > +             if (cb->nf < 2)
> > > +                     error(EFAIL, "prof_btdepth DEPTH");
> > > +             profiler_backtrace_depth = atoi(cb->f[1]);  
> >
> > It is dangerous to have the user control this.  It's a stack
> > allocation. Even
> > if it wasn't, we'd need a sanity check of some sort.
> >  
> 
> I will drop the BT depth config, and limit the others.

Great.

> > > +void profiler_cleanup(void)
> > > +{
> > > +     sem_down(&profiler_mtx);
> > > +     profiler_users--;
> > > +     if (profiler_users == 0)
> > > +             free_cpu_buffers();
> > > +     sem_up(&profiler_mtx);
> > >  }  
> >
> > I'm still concerned about this.  If the only source of profiling
> > data is from
> > the timer IRQs, then your current stuff is seems fine.  (You
> > disable the timers,
> > and thus the handlers) before freeing this stuff).  But if we ever
> > have any other use of this, then we'll need to be careful.
> >  
> 
> As I said, when I come in with the overflow-triggered sampling, this
> will need some retuning.

Sounds good.  TODO.  =)

> > But there's more to it than that.  The assumption here is that once
> > profiler_users == 0, then there is no code, (RKM, pending IRQ,
> > etc.) that has
> > access to profiler_percpu_ctx or profiler_queue.
> >
> > Just as an example, say someone is calling profiler_notify_mmap(),
> > which checks
> > profiler_percpu_ctx.  They get past the if () check, then
> > concurrently someone
> > calls profiler_cleanup and triggers free_cpu_buffers.  Then the
> > original thread
> > eventually calls profiler_push_pid_mmap, then qwrite, then page
> > faults.
> >
> > So a user could theoretically trigger a PF.  This is part of the
> > reason why I
> > was reluctant to have you try and free the buffers.
> >  
> 
> Well, we have no RCU ___
> In such case we should swap ptr with NULL, and RCU-free it.

RCU wouldn't help as much here either, since the profiler_notify_mmap()
thread could have blocked at some point.  And sleeping RCU sounds like
a pain.

And I imagine there are other places than this one example where it is
dangerous to free the buffers.

So I'm not convinced that this assumption is true:

> > The assumption here is that once profiler_users == 0, then there
> > is no code, (RKM, pending IRQ, etc.) that has access to
> > profiler_percpu_ctx or profiler_queue.

One simple "fix" would be to not free the buffers, and put a TODO on
it.  that would alleviate the risk of the page fault (possibly caused
by a malicious user) at the expense of a small amount of memory.

> > > @@ -728,6 +728,7 @@ static ssize_t sys_fork(env_t* e)
> > >
> > >       printd("[PID %d] fork PID %d\n", e->pid, env->pid);
> > >       ret = env->pid;
> > > +     profiler_notify_new_process(env);
> > >       proc_decref(env);       /* give up the reference created
> > > in  
> > proc_alloc() */  
> > >       return ret;
> > >  }  
> >
> > Do you need to update things when a process changes its binary path?
> > (exec)
> >  
> 
> That should lead to new mmaps.

I guess the broader question is when does the profiler need to know
things like the binary path and it's mmaps.  But if you're happy with
it, then I'm ok.  =)

> > > -     if (unlikely(booting))
> > > +     if (unlikely(!num_cores))
> > >               return &boot_tpb;  
> >
> > That seems a little odd.  I'd think if we're still booting, we'd
> > use the boot_tpb.  Was there some corner case that triggered this?
> > I understand this
> > one:
> >  
> > > -     if (likely(!booting))
> > > +     if (likely(system_timing.tsc_freq))
> > >               tsc2timespec(read_tsc(), &ts_now);  
> >  
> 
> The booting flag is cleared really late, and I would not want stuff
> starting spinning on other cores, colliding on boot tbp.

Ah, ok, makes sense.

Barret

 

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - New Akaros profiler

Reply via email to