Re: [akaros] Code Review - New Akaros profiler

'Davide Libenzi' via Akaros Sat, 14 Nov 2015 06:07:11 -0800

Here a link to the top of the branch where the two WIP are:

https://github.com/dlibenzi/akaros/commits/profiler_dev_v2



On Fri, Nov 13, 2015 at 7:54 PM, Davide Libenzi <[email protected]> wrote:

> On Fri, Nov 13, 2015 at 9:19 AM, barret rhoden <[email protected]>
> wrote:
>
>> > > > */ qdropoverflow(profiler_queue, 1);
>> > >
>> > > Then later:
>> > >
>> > > > +     qnonblock(profiler_queue, 1);
>> > >
>> > > This is fragile.  qnonblock means "error with EAGAIN if we're full".
>> > > Right now, you'll get by, since you still have qdropoverflow
>> > > above.  But this could break with an error() thrown from IRQ
>> > > context somewhere down the line.
>> > >
>> >
>> > Yes, there are paired. I am not sure which kind of bad condition can
>> > happen.
>>
>> The bad condition is that you may unexpectedly get an error().  If you
>> have dropoverflow, nonblock does nothing, so do not set it.  things may
>> change in qio in the future.
>>
>
> OK, removing nonblock.
> Coming to error(), should I always waserror() in my code, even if I have
> nothing to be undone?
> Because, that kinda defeats the exception-look-alike error model.
>
>
>  > > >  void profiler_cleanup(void)
>
>> > > >  {
>> > > > -     free_cpu_buffers();
>> > > > +     sem_down(&mtx);
>> > > > +     profiler_users--;
>> > > > +     if (profiler_users == 0)
>> > > > +             free_cpu_buffers();
>> > > > +     sem_up(&mtx);
>> > > >  }
>> > >
>> > > This might be a candidate for a kref, depending on how cleanup ends
>> > > up working.  No need to change it now though.
>> > >
>> >
>> > kref don't block until the 0->1 transitioner is initializing, no?
>>
>> krefs never block, so i don't fully understand.  my comment was that
>> cleanup seems dangerous here.  this code is hand-rolling the kref
>> pattern, so in the end a kref might be the right tool here.
>>
>
> Is not hand rolling, because kref will not make the immediately following
> thread going to up-it, wait for the first one to do the proper
> initialization of the resources.
> But I think a blocking lock for init, coupled with a kref, might solve the
> buffers free issue. See down below ...
>
>
>
>
> > Locking is up to the reader.
>> > In a circular buffer fashion, a write on full buffer discard oldest
>> > buffer, and write new one.
>> > No, reading data does not remove data. The only way the read ptr is
>> > moved forward, is when the write ptr pushes it forward in order to
>> > fit new data. Every block in the buffer, is pre-pended with its size.
>> > That is the type of the size.
>>
>> Great.  Please put that in the source too.
>>
>
> circular_buffer_write already has the comments in its body, which describes 
> the operation.
> Adding some more to the skip API.
>
>
>
>
>>
>> > > It's a little surprising that clear also frees the allocation,
>> > > instead of just
>> > > resetting everything back to 0.
>> > >
>> >
>> > This clears all the resources of the buffer.
>> > I used "clear" as there is an "init".
>> > I would have used "free" if there was an "alloc".
>>
>> This was originally unclear since it was not clear how the contents of
>> the ring buffer get discarded.  we have another ring buffer in the
>> kernel that takes data, then stops taking *new* data until the old
>> contents were 'cleared.'
>>
>> either way, 'clear' doesn't sound like destroy, free, deinit, or
>> shutdown.  all of which have a more 'final' sound to it.
>>
>
> Renamed clear->destroy, and had clear do the simple clear but not free the
> resources.
>
>
>
>
>>
>> > > What are these helpers doing?
>> > >
>> >
>> > The overlap check is a write operation is going to overwrite the
>> > block at which rdptr resides.
>> > The skip, moves forward the rdptr until it no more overlaps with the
>> > incoming write.
>>
>> Thanks.  Please put that in a comment right above the functions.
>>
>
> Done.
>
>
>
> Agreed, this is more of a "please try to avoid this in the future
>> comment"
>>
>
> This code was composed by three categories.
> First, totally new code files, for which, makes little sense to add single
> functions in single commits, especially when you, along the way, totally
> refactor older ones because the new APIs requires it (been told to squash
> commits touching the same code, to avoid commenting on code which is no
> more there).
> Second, totally changed files, which really, had nothing more in common
> with previous ones.
> Third, removed a lot of files which were no more used.
> Bottom line, you were looking at a multi thousand lines of code CR, whose
> nature, and size, little would have helped some sugar coating in terms of
> micro commits 😐
>
>
>
>
>>
>> > As for the timers, for Ron's case, you can still turn them on aside
>> > from the profiler.
>> > For a profiler POV, I think the default behavior should be them on
>> > everywhere, otherwise you lose the full picture of what the system is
>> > doing. We could add a selective CPU timer and profiler turn on, but
>> > IMO the default should be all on.
>>
>> One problem is I don't see how to turn on the profiler on a single
>> core.  If you start collecting, you start the timer on all cores.  I
>> don't see why it can't be the way it was before.  "Pick the timers you
>> want, then turn on the accounting."
>>
>> I think fix is to have multiple commands.  One is the timer (as is),
>> one is to start just the collection, and one is the aggregate: all
>> timers and start the collection.  This is a combination of all the
>> approaches.  There's our existing style (timer, then collection) as
>> well as your desired style (just turn the whole thing on damnit).
>>
>
> Let's please wait how this will be shaped, once the new overflow IRQ stuff
> comes in.
> If you can please review the MSR stuff, I am almost done with the new
> profiler tool which uses libpfm4 to list CPU event counters, and resolve
> them by name, allowing to start/stop/read/write the counters.
> So the sequence is going to be:
>
> 1) this CR (code in place, CR under review)
> 2) MSR (code in place, CR posted)
> 3) New prof tool with libpfm4 (code at 75%, no CR yet)
> 4) New overflow based sampling (no code yet)
>
>
>
>
>>
>> > > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t
>> > > > *pcs,
>> > > > +                                                size_t nr_slots)
>> > > > +{
>> > > > +     size_t nr_pcs = 0;
>> > > > +
>> > > > +     for (;;) {
>> > > > +             if (unlikely(fp >= UMAPTOP) || unlikely(fp <
>> > > > BRK_END) ||
>> > >
>> > > Why not is_user_raddr()?  And once you use copy_from, those checks
>> > > aren't needed
>> > > either.  Also, I think it is possible for frame pointers to be below
>> > > BRK_END,
>> > > which could happen if a user stack was malloc'd instead of mmapped.
>> > >
>> > > > +                     unlikely(nr_pcs >= nr_slots))
>> > > > +                     break;
>> > > > +
>> > > > +             /* For now straight memory access, waiting for
>> > > copy_from_user(). */
>> > >
>> > > copy_from is now available.  =)
>> > >
>> >
>> > I will do that is a separate CL.
>>
>> The beauty of Git is that you can make that small change (just use
>> copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit",
>> then do a rebase -i and meld that fixup into this existing commit.
>>
>
> Yes, but this branch became older than Matusalem, this is why I wanted to
> defer.
> Defer for me, is not a month ahead, is like 2 days (or even before, if
> this gets merged).
> Anyway, doing it in this CR for this.
>
>
>
> > > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void)
>> > > > +{
>> > > > +     static struct trace_printk_buffer boot_tpb;
>> > > > +     static struct trace_printk_buffer *cpu_tpbs;
>> > > > +
>> > > > +     if (unlikely(booting))
>> > > > +             return &boot_tpb;
>> > > > +     if (unlikely(!cpu_tpbs)) {
>> > > > +             /* Poor man per-CPU data structure. I really do no
>> > > > like
>> > > littering global
>> > > > +              * data structures with module specific data.
>> > > > +              */
>> > > > +             spin_lock_irqsave(&ktrace_lock);
>> > > > +             if (!cpu_tpbs)
>> > > > +                     cpu_tpbs = kzmalloc(num_cores *
>> > > > sizeof(struct
>> > > trace_printk_buffer),
>> > > > +
>> > > > 0);
>> > >
>> > > If this alloc fails, (flags == 0), then we'll return 0 (+coreid)
>> > > and PF later.
>> > >
>> >
>> > We could panic, or return boot tpb.
>> > Today a OOM crashes the box, so I am not sure what to do here.
>>
>> Could do the alloc with KMALLOC_WAIT outside the lock.  Then lock and
>> try to do the assign.  If someone else already assigned cpu_tpbs, then
>> we free the one we just got.
>>
>
> I think it is better, because this might be called from printk(), from
> places where we can't sleep, not to KMALLOC_WAIT.
> Considering also the fact that this is called *very* early in the boot
> process (due to linkage to printk), so if we do not have memory at that
> point, we can pretty much panic() there.
>
>
>
>
>>
>> > > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE,
>> > > > args)
>> > >                                                             ^FALSE?
>> > > We just want to print text, not a backtrace.  That was also my
>> > > intent for trace_printk (it's just the print, not a backtrace).
>> > >
>> >
>> > Seems pretty useful to have the option to get a BT, to understand from
>> > which path you are coming from.
>> > A "grep" of the stirng simply tells where the print is.
>>
>> I like the option for the backtrace.  I'm just saying that i'd like a
>> simple trace_printk/printx that doesn't trigger the backtrace.  we
>> already have TRACEME for the backtrace trace_printk.
>>
>> Anyway, if you don't want to fix it, that's fine.  If I get annoyed at
>> a long backtrace in the future when trace_printxing I'll change it.
>>
>
> trace_printx() like the name suggests, traces BT.
> printx() call printk(), which does not trace.
>
>
>
> > > This probably should be in a library or header (so we can reuse it)
>> > > and with a
>> > > little explanation.  It looks like we take 7 bits of n at a time
>> > > and push it
>> > > into *data, with bit 0x80 signifying we're done or not.
>> > >
>> >
>> > Yep, that's what t does.
>>
>> Cool, please note that in the code.  I have a bad memory.  =)
>>
>
> Done.
>
>
>
> > I wasn't sure if variable byte encoding would be something a kernel
>> > would be doing elsewhere.
>>
>> I wasn't sure either.  It sounds pretty cool.  We can keep it here for
>> now, or move it elsewhere if you'd like.
>>
>
> Yeah, when we have a search engine in the kernel, we can pull it out for
> the posting lists 😀
>
>
>
> As far as the age of this branch goes, it's a huge change that takes a
>> while to get through.  Maybe I'm just slow, but I've spent 2 days on
>> this and counting.
>>
>
> Well, they say on AVG it takes 1/8 of the time taken to write the code, to
> review it.
> So do your math ☺
>
>
> Ah, I see.  In the future, you know you were given a block and you want
>> to return that block in a ready-to-go manner.
>>
>> This is another case where we can make the function you need and add it
>> to the block code.  That way you also don't need to worry about
>> checking the return value of this function (since iallocb can fail, but
>> if you return the block you got originally, that can't fail).
>>
>> Not to harp on this too much, but I would rather have had that block
>> code done as a support patch before this commit, and then used here,
>> than doing it the way it is.  That also would have been clearer to me,
>> would have improved the kernel, and would have saved some time.  =)
>>
>> But feel free to leave it as is now, and we can address that with a
>> future TODO.
>>
>
> IMO it is a simple performance optimization, which can be done as follow
> up (together with the other drop tail if overflow change).
> Blocks are written at the rate of an handful per second.
>
>
>
>
>>
>> > > +     qdropoverflow(profiler_queue, 1);
>> > > > +     qnonblock(profiler_queue, 1);
>> > >
>> > > As mentioned in a previous patch, you might not want qnonblock.
>> > >
>> >
>> > Not sure why not?
>>
>> Because they do similar things in different ways.
>>
>
> Removed nonblock.
>
>
>
>>
>> > > > +     profiler_percpu_ctx =
>> > > > +             kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores,
>> > > KMALLOC_WAIT);
>> > > > +     if (!profiler_percpu_ctx)
>> > > > +             error(-ENOMEM, NULL);
>> > >
>> > > Since you did a KMALLOC_WAIT, you will never
>> > > have !profiler_percpu_ctx
>> >
>> > I am not sure waiting is a guarantee that you will get memory anyway.
>>
>> I think it is in other OSs.  In Akaros, the implementation isn't done
>> yet, but the design is that if you say KMALLOC_WAIT (or whatever we
>> call it later), that your thread will block until the memory allocation
>> succeeds.
>>
>
> Removed the test for NULL.
> IMO though, it is a bad idea to remove checks, as now you are bound to a
> design.
>
>
>
>
>>
>> > > -static inline void profiler_begin_trace(struct profiler_cpu_context
>> > > *cpu_buf)
>> > > > +int profiler_configure(struct cmdbuf *cb)
>> > > >  {
>> > > > -     cpu_buf->tracing = 1;
>> > > > +     if (!strcmp(cb->f[0], "prof_qlimit")) {
>> > > > +             if (cb->nf < 2)
>> > > > +                     error(EFAIL, "prof_qlimit KB");
>> > > > +             profiler_queue_limit = atoi(cb->f[1]) * 1024;
>> > >
>> > > Probably want a sanity check.  Also, what happens when this changes
>> > > after the
>> > > queue was already allocated?
>> > >
>> >
>> > The configuration must be done before the profiler is started.
>>
>> Ok, then we need a check to return an error for this case.  Otherwise a
>> buggy user could change that setting on the fly.
>>
>
> OK. I fail on queue limit, if running, while the percpu buffer size can be
> changed on the fly.
>
>
>
>> > > But there's more to it than that.  The assumption here is that once
>> > > profiler_users == 0, then there is no code, (RKM, pending IRQ,
>> > > etc.) that has
>> > > access to profiler_percpu_ctx or profiler_queue.
>> > >
>> > > Just as an example, say someone is calling profiler_notify_mmap(),
>> > > which checks
>> > > profiler_percpu_ctx.  They get past the if () check, then
>> > > concurrently someone
>> > > calls profiler_cleanup and triggers free_cpu_buffers.  Then the
>> > > original thread
>> > > eventually calls profiler_push_pid_mmap, then qwrite, then page
>> > > faults.
>> > >
>> > > So a user could theoretically trigger a PF.  This is part of the
>> > > reason why I
>> > > was reluctant to have you try and free the buffers.
>> > >
>> >
>> > Well, we have no RCU ___
>> > In such case we should swap ptr with NULL, and RCU-free it.
>>
>> RCU wouldn't help as much here either, since the profiler_notify_mmap()
>> thread could have blocked at some point.  And sleeping RCU sounds like
>> a pain.
>>
>> And I imagine there are other places than this one example where it is
>> dangerous to free the buffers.
>>
>> So I'm not convinced that this assumption is true:
>>
>
> Sleeping RCU is not a problem. We did it in userspace (arguably more
> difficult than kernel) in Cassini, and the whole thing was less 1000 lines
> of code 😉
>
>
>
>>
>> > > The assumption here is that once profiler_users == 0, then there
>> > > is no code, (RKM, pending IRQ, etc.) that has access to
>> > > profiler_percpu_ctx or profiler_queue.
>>
>> One simple "fix" would be to not free the buffers, and put a TODO on
>> it.  that would alleviate the risk of the page fault (possibly caused
>> by a malicious user) at the expense of a small amount of memory.
>>
>
> Here we have two paths.
> The one used to write normal traces, that do not block (better be, being
> called from timer IRQ).
> The one coming from mmap and process announcements, which ATM call
> qwrite(), which might block
> I changed those to qiwrite(), as well the dropping the KMALLOC_WAIT in the
> functions that post mmap and process announcements.
> This better be, as enum vmrs runs with a spinlock held.
>
> Using kref coupled with qlock, we should be able to get the necessary
> semantics.
> I added two commits (WIP:PROFILER and WIP:KPROF) at the end of the stream,
> which did not squash in the proper place yet.
> Take a look. I really do not like the idea of having resources left
> allocated when the functionalities are not used.
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - New Akaros profiler

Reply via email to