Re: [akaros] Code Review - New Akaros profiler

'Davide Libenzi' via Akaros Sat, 14 Nov 2015 06:32:31 -0800

Also, I am planning the new perf tool which is is the devarch_msr_cont
branch, to also have options to activate and deactivate the profiler,
hiding the /prof/ VFS internals.
Like:


$ prof --start-profiler ..other_options..
$ prof --stop-profiler ..other_options..
...

Eventually I will move the vast part of the kprof2perf file format
converter tool, in a library, that both the new prof tool, and kprof2perf
can use.


On Sat, Nov 14, 2015 at 6:06 AM, Davide Libenzi <[email protected]> wrote:

> Here a link to the top of the branch where the two WIP are:
>
> https://github.com/dlibenzi/akaros/commits/profiler_dev_v2
>
>
> On Fri, Nov 13, 2015 at 7:54 PM, Davide Libenzi <[email protected]>
> wrote:
>
>> On Fri, Nov 13, 2015 at 9:19 AM, barret rhoden <[email protected]>
>> wrote:
>>
>>> > > > */ qdropoverflow(profiler_queue, 1);
>>> > >
>>> > > Then later:
>>> > >
>>> > > > +     qnonblock(profiler_queue, 1);
>>> > >
>>> > > This is fragile.  qnonblock means "error with EAGAIN if we're full".
>>> > > Right now, you'll get by, since you still have qdropoverflow
>>> > > above.  But this could break with an error() thrown from IRQ
>>> > > context somewhere down the line.
>>> > >
>>> >
>>> > Yes, there are paired. I am not sure which kind of bad condition can
>>> > happen.
>>>
>>> The bad condition is that you may unexpectedly get an error().  If you
>>> have dropoverflow, nonblock does nothing, so do not set it.  things may
>>> change in qio in the future.
>>>
>>
>> OK, removing nonblock.
>> Coming to error(), should I always waserror() in my code, even if I have
>> nothing to be undone?
>> Because, that kinda defeats the exception-look-alike error model.
>>
>>
>>  > > >  void profiler_cleanup(void)
>>
>>> > > >  {
>>> > > > -     free_cpu_buffers();
>>> > > > +     sem_down(&mtx);
>>> > > > +     profiler_users--;
>>> > > > +     if (profiler_users == 0)
>>> > > > +             free_cpu_buffers();
>>> > > > +     sem_up(&mtx);
>>> > > >  }
>>> > >
>>> > > This might be a candidate for a kref, depending on how cleanup ends
>>> > > up working.  No need to change it now though.
>>> > >
>>> >
>>> > kref don't block until the 0->1 transitioner is initializing, no?
>>>
>>> krefs never block, so i don't fully understand.  my comment was that
>>> cleanup seems dangerous here.  this code is hand-rolling the kref
>>> pattern, so in the end a kref might be the right tool here.
>>>
>>
>> Is not hand rolling, because kref will not make the immediately following
>> thread going to up-it, wait for the first one to do the proper
>> initialization of the resources.
>> But I think a blocking lock for init, coupled with a kref, might solve
>> the buffers free issue. See down below ...
>>
>>
>>
>>
>> > Locking is up to the reader.
>>> > In a circular buffer fashion, a write on full buffer discard oldest
>>> > buffer, and write new one.
>>> > No, reading data does not remove data. The only way the read ptr is
>>> > moved forward, is when the write ptr pushes it forward in order to
>>> > fit new data. Every block in the buffer, is pre-pended with its size.
>>> > That is the type of the size.
>>>
>>> Great.  Please put that in the source too.
>>>
>>
>> circular_buffer_write already has the comments in its body, which describes 
>> the operation.
>> Adding some more to the skip API.
>>
>>
>>
>>
>>>
>>> > > It's a little surprising that clear also frees the allocation,
>>> > > instead of just
>>> > > resetting everything back to 0.
>>> > >
>>> >
>>> > This clears all the resources of the buffer.
>>> > I used "clear" as there is an "init".
>>> > I would have used "free" if there was an "alloc".
>>>
>>> This was originally unclear since it was not clear how the contents of
>>> the ring buffer get discarded.  we have another ring buffer in the
>>> kernel that takes data, then stops taking *new* data until the old
>>> contents were 'cleared.'
>>>
>>> either way, 'clear' doesn't sound like destroy, free, deinit, or
>>> shutdown.  all of which have a more 'final' sound to it.
>>>
>>
>> Renamed clear->destroy, and had clear do the simple clear but not free
>> the resources.
>>
>>
>>
>>
>>>
>>> > > What are these helpers doing?
>>> > >
>>> >
>>> > The overlap check is a write operation is going to overwrite the
>>> > block at which rdptr resides.
>>> > The skip, moves forward the rdptr until it no more overlaps with the
>>> > incoming write.
>>>
>>> Thanks.  Please put that in a comment right above the functions.
>>>
>>
>> Done.
>>
>>
>>
>> Agreed, this is more of a "please try to avoid this in the future
>>> comment"
>>>
>>
>> This code was composed by three categories.
>> First, totally new code files, for which, makes little sense to add
>> single functions in single commits, especially when you, along the way,
>> totally refactor older ones because the new APIs requires it (been told to
>> squash commits touching the same code, to avoid commenting on code which is
>> no more there).
>> Second, totally changed files, which really, had nothing more in common
>> with previous ones.
>> Third, removed a lot of files which were no more used.
>> Bottom line, you were looking at a multi thousand lines of code CR, whose
>> nature, and size, little would have helped some sugar coating in terms of
>> micro commits 😐
>>
>>
>>
>>
>>>
>>> > As for the timers, for Ron's case, you can still turn them on aside
>>> > from the profiler.
>>> > For a profiler POV, I think the default behavior should be them on
>>> > everywhere, otherwise you lose the full picture of what the system is
>>> > doing. We could add a selective CPU timer and profiler turn on, but
>>> > IMO the default should be all on.
>>>
>>> One problem is I don't see how to turn on the profiler on a single
>>> core.  If you start collecting, you start the timer on all cores.  I
>>> don't see why it can't be the way it was before.  "Pick the timers you
>>> want, then turn on the accounting."
>>>
>>> I think fix is to have multiple commands.  One is the timer (as is),
>>> one is to start just the collection, and one is the aggregate: all
>>> timers and start the collection.  This is a combination of all the
>>> approaches.  There's our existing style (timer, then collection) as
>>> well as your desired style (just turn the whole thing on damnit).
>>>
>>
>> Let's please wait how this will be shaped, once the new overflow IRQ
>> stuff comes in.
>> If you can please review the MSR stuff, I am almost done with the new
>> profiler tool which uses libpfm4 to list CPU event counters, and resolve
>> them by name, allowing to start/stop/read/write the counters.
>> So the sequence is going to be:
>>
>> 1) this CR (code in place, CR under review)
>> 2) MSR (code in place, CR posted)
>> 3) New prof tool with libpfm4 (code at 75%, no CR yet)
>> 4) New overflow based sampling (no code yet)
>>
>>
>>
>>
>>>
>>> > > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t
>>> > > > *pcs,
>>> > > > +                                                size_t nr_slots)
>>> > > > +{
>>> > > > +     size_t nr_pcs = 0;
>>> > > > +
>>> > > > +     for (;;) {
>>> > > > +             if (unlikely(fp >= UMAPTOP) || unlikely(fp <
>>> > > > BRK_END) ||
>>> > >
>>> > > Why not is_user_raddr()?  And once you use copy_from, those checks
>>> > > aren't needed
>>> > > either.  Also, I think it is possible for frame pointers to be below
>>> > > BRK_END,
>>> > > which could happen if a user stack was malloc'd instead of mmapped.
>>> > >
>>> > > > +                     unlikely(nr_pcs >= nr_slots))
>>> > > > +                     break;
>>> > > > +
>>> > > > +             /* For now straight memory access, waiting for
>>> > > copy_from_user(). */
>>> > >
>>> > > copy_from is now available.  =)
>>> > >
>>> >
>>> > I will do that is a separate CL.
>>>
>>> The beauty of Git is that you can make that small change (just use
>>> copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit",
>>> then do a rebase -i and meld that fixup into this existing commit.
>>>
>>
>> Yes, but this branch became older than Matusalem, this is why I wanted to
>> defer.
>> Defer for me, is not a month ahead, is like 2 days (or even before, if
>> this gets merged).
>> Anyway, doing it in this CR for this.
>>
>>
>>
>> > > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void)
>>> > > > +{
>>> > > > +     static struct trace_printk_buffer boot_tpb;
>>> > > > +     static struct trace_printk_buffer *cpu_tpbs;
>>> > > > +
>>> > > > +     if (unlikely(booting))
>>> > > > +             return &boot_tpb;
>>> > > > +     if (unlikely(!cpu_tpbs)) {
>>> > > > +             /* Poor man per-CPU data structure. I really do no
>>> > > > like
>>> > > littering global
>>> > > > +              * data structures with module specific data.
>>> > > > +              */
>>> > > > +             spin_lock_irqsave(&ktrace_lock);
>>> > > > +             if (!cpu_tpbs)
>>> > > > +                     cpu_tpbs = kzmalloc(num_cores *
>>> > > > sizeof(struct
>>> > > trace_printk_buffer),
>>> > > > +
>>> > > > 0);
>>> > >
>>> > > If this alloc fails, (flags == 0), then we'll return 0 (+coreid)
>>> > > and PF later.
>>> > >
>>> >
>>> > We could panic, or return boot tpb.
>>> > Today a OOM crashes the box, so I am not sure what to do here.
>>>
>>> Could do the alloc with KMALLOC_WAIT outside the lock.  Then lock and
>>> try to do the assign.  If someone else already assigned cpu_tpbs, then
>>> we free the one we just got.
>>>
>>
>> I think it is better, because this might be called from printk(), from
>> places where we can't sleep, not to KMALLOC_WAIT.
>> Considering also the fact that this is called *very* early in the boot
>> process (due to linkage to printk), so if we do not have memory at that
>> point, we can pretty much panic() there.
>>
>>
>>
>>
>>>
>>> > > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE,
>>> > > > args)
>>> > >                                                             ^FALSE?
>>> > > We just want to print text, not a backtrace.  That was also my
>>> > > intent for trace_printk (it's just the print, not a backtrace).
>>> > >
>>> >
>>> > Seems pretty useful to have the option to get a BT, to understand from
>>> > which path you are coming from.
>>> > A "grep" of the stirng simply tells where the print is.
>>>
>>> I like the option for the backtrace.  I'm just saying that i'd like a
>>> simple trace_printk/printx that doesn't trigger the backtrace.  we
>>> already have TRACEME for the backtrace trace_printk.
>>>
>>> Anyway, if you don't want to fix it, that's fine.  If I get annoyed at
>>> a long backtrace in the future when trace_printxing I'll change it.
>>>
>>
>> trace_printx() like the name suggests, traces BT.
>> printx() call printk(), which does not trace.
>>
>>
>>
>> > > This probably should be in a library or header (so we can reuse it)
>>> > > and with a
>>> > > little explanation.  It looks like we take 7 bits of n at a time
>>> > > and push it
>>> > > into *data, with bit 0x80 signifying we're done or not.
>>> > >
>>> >
>>> > Yep, that's what t does.
>>>
>>> Cool, please note that in the code.  I have a bad memory.  =)
>>>
>>
>> Done.
>>
>>
>>
>> > I wasn't sure if variable byte encoding would be something a kernel
>>> > would be doing elsewhere.
>>>
>>> I wasn't sure either.  It sounds pretty cool.  We can keep it here for
>>> now, or move it elsewhere if you'd like.
>>>
>>
>> Yeah, when we have a search engine in the kernel, we can pull it out for
>> the posting lists 😀
>>
>>
>>
>> As far as the age of this branch goes, it's a huge change that takes a
>>> while to get through.  Maybe I'm just slow, but I've spent 2 days on
>>> this and counting.
>>>
>>
>> Well, they say on AVG it takes 1/8 of the time taken to write the code,
>> to review it.
>> So do your math ☺
>>
>>
>> Ah, I see.  In the future, you know you were given a block and you want
>>> to return that block in a ready-to-go manner.
>>>
>>> This is another case where we can make the function you need and add it
>>> to the block code.  That way you also don't need to worry about
>>> checking the return value of this function (since iallocb can fail, but
>>> if you return the block you got originally, that can't fail).
>>>
>>> Not to harp on this too much, but I would rather have had that block
>>> code done as a support patch before this commit, and then used here,
>>> than doing it the way it is.  That also would have been clearer to me,
>>> would have improved the kernel, and would have saved some time.  =)
>>>
>>> But feel free to leave it as is now, and we can address that with a
>>> future TODO.
>>>
>>
>> IMO it is a simple performance optimization, which can be done as follow
>> up (together with the other drop tail if overflow change).
>> Blocks are written at the rate of an handful per second.
>>
>>
>>
>>
>>>
>>> > > +     qdropoverflow(profiler_queue, 1);
>>> > > > +     qnonblock(profiler_queue, 1);
>>> > >
>>> > > As mentioned in a previous patch, you might not want qnonblock.
>>> > >
>>> >
>>> > Not sure why not?
>>>
>>> Because they do similar things in different ways.
>>>
>>
>> Removed nonblock.
>>
>>
>>
>>>
>>> > > > +     profiler_percpu_ctx =
>>> > > > +             kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores,
>>> > > KMALLOC_WAIT);
>>> > > > +     if (!profiler_percpu_ctx)
>>> > > > +             error(-ENOMEM, NULL);
>>> > >
>>> > > Since you did a KMALLOC_WAIT, you will never
>>> > > have !profiler_percpu_ctx
>>> >
>>> > I am not sure waiting is a guarantee that you will get memory anyway.
>>>
>>> I think it is in other OSs.  In Akaros, the implementation isn't done
>>> yet, but the design is that if you say KMALLOC_WAIT (or whatever we
>>> call it later), that your thread will block until the memory allocation
>>> succeeds.
>>>
>>
>> Removed the test for NULL.
>> IMO though, it is a bad idea to remove checks, as now you are bound to a
>> design.
>>
>>
>>
>>
>>>
>>> > > -static inline void profiler_begin_trace(struct profiler_cpu_context
>>> > > *cpu_buf)
>>> > > > +int profiler_configure(struct cmdbuf *cb)
>>> > > >  {
>>> > > > -     cpu_buf->tracing = 1;
>>> > > > +     if (!strcmp(cb->f[0], "prof_qlimit")) {
>>> > > > +             if (cb->nf < 2)
>>> > > > +                     error(EFAIL, "prof_qlimit KB");
>>> > > > +             profiler_queue_limit = atoi(cb->f[1]) * 1024;
>>> > >
>>> > > Probably want a sanity check.  Also, what happens when this changes
>>> > > after the
>>> > > queue was already allocated?
>>> > >
>>> >
>>> > The configuration must be done before the profiler is started.
>>>
>>> Ok, then we need a check to return an error for this case.  Otherwise a
>>> buggy user could change that setting on the fly.
>>>
>>
>> OK. I fail on queue limit, if running, while the percpu buffer size can
>> be changed on the fly.
>>
>>
>>
>>> > > But there's more to it than that.  The assumption here is that once
>>> > > profiler_users == 0, then there is no code, (RKM, pending IRQ,
>>> > > etc.) that has
>>> > > access to profiler_percpu_ctx or profiler_queue.
>>> > >
>>> > > Just as an example, say someone is calling profiler_notify_mmap(),
>>> > > which checks
>>> > > profiler_percpu_ctx.  They get past the if () check, then
>>> > > concurrently someone
>>> > > calls profiler_cleanup and triggers free_cpu_buffers.  Then the
>>> > > original thread
>>> > > eventually calls profiler_push_pid_mmap, then qwrite, then page
>>> > > faults.
>>> > >
>>> > > So a user could theoretically trigger a PF.  This is part of the
>>> > > reason why I
>>> > > was reluctant to have you try and free the buffers.
>>> > >
>>> >
>>> > Well, we have no RCU ___
>>> > In such case we should swap ptr with NULL, and RCU-free it.
>>>
>>> RCU wouldn't help as much here either, since the profiler_notify_mmap()
>>> thread could have blocked at some point.  And sleeping RCU sounds like
>>> a pain.
>>>
>>> And I imagine there are other places than this one example where it is
>>> dangerous to free the buffers.
>>>
>>> So I'm not convinced that this assumption is true:
>>>
>>
>> Sleeping RCU is not a problem. We did it in userspace (arguably more
>> difficult than kernel) in Cassini, and the whole thing was less 1000 lines
>> of code 😉
>>
>>
>>
>>>
>>> > > The assumption here is that once profiler_users == 0, then there
>>> > > is no code, (RKM, pending IRQ, etc.) that has access to
>>> > > profiler_percpu_ctx or profiler_queue.
>>>
>>> One simple "fix" would be to not free the buffers, and put a TODO on
>>> it.  that would alleviate the risk of the page fault (possibly caused
>>> by a malicious user) at the expense of a small amount of memory.
>>>
>>
>> Here we have two paths.
>> The one used to write normal traces, that do not block (better be, being
>> called from timer IRQ).
>> The one coming from mmap and process announcements, which ATM call
>> qwrite(), which might block
>> I changed those to qiwrite(), as well the dropping the KMALLOC_WAIT in
>> the functions that post mmap and process announcements.
>> This better be, as enum vmrs runs with a spinlock held.
>>
>> Using kref coupled with qlock, we should be able to get the necessary
>> semantics.
>> I added two commits (WIP:PROFILER and WIP:KPROF) at the end of the
>> stream, which did not squash in the proper place yet.
>> Take a look. I really do not like the idea of having resources left
>> allocated when the functionalities are not used.
>>
>>
>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - New Akaros profiler

Reply via email to