On Fri, Nov 13, 2015 at 9:19 AM, barret rhoden <[email protected]> wrote:
> > > > */ qdropoverflow(profiler_queue, 1);
> > >
> > > Then later:
> > >
> > > > + qnonblock(profiler_queue, 1);
> > >
> > > This is fragile. qnonblock means "error with EAGAIN if we're full".
> > > Right now, you'll get by, since you still have qdropoverflow
> > > above. But this could break with an error() thrown from IRQ
> > > context somewhere down the line.
> > >
> >
> > Yes, there are paired. I am not sure which kind of bad condition can
> > happen.
>
> The bad condition is that you may unexpectedly get an error(). If you
> have dropoverflow, nonblock does nothing, so do not set it. things may
> change in qio in the future.
>
OK, removing nonblock.
Coming to error(), should I always waserror() in my code, even if I have
nothing to be undone?
Because, that kinda defeats the exception-look-alike error model.
> > > void profiler_cleanup(void)
> > > > {
> > > > - free_cpu_buffers();
> > > > + sem_down(&mtx);
> > > > + profiler_users--;
> > > > + if (profiler_users == 0)
> > > > + free_cpu_buffers();
> > > > + sem_up(&mtx);
> > > > }
> > >
> > > This might be a candidate for a kref, depending on how cleanup ends
> > > up working. No need to change it now though.
> > >
> >
> > kref don't block until the 0->1 transitioner is initializing, no?
>
> krefs never block, so i don't fully understand. my comment was that
> cleanup seems dangerous here. this code is hand-rolling the kref
> pattern, so in the end a kref might be the right tool here.
>
Is not hand rolling, because kref will not make the immediately following
thread going to up-it, wait for the first one to do the proper
initialization of the resources.
But I think a blocking lock for init, coupled with a kref, might solve the
buffers free issue. See down below ...
> Locking is up to the reader.
> > In a circular buffer fashion, a write on full buffer discard oldest
> > buffer, and write new one.
> > No, reading data does not remove data. The only way the read ptr is
> > moved forward, is when the write ptr pushes it forward in order to
> > fit new data. Every block in the buffer, is pre-pended with its size.
> > That is the type of the size.
>
> Great. Please put that in the source too.
>
circular_buffer_write already has the comments in its body, which
describes the operation.
Adding some more to the skip API.
>
> > > It's a little surprising that clear also frees the allocation,
> > > instead of just
> > > resetting everything back to 0.
> > >
> >
> > This clears all the resources of the buffer.
> > I used "clear" as there is an "init".
> > I would have used "free" if there was an "alloc".
>
> This was originally unclear since it was not clear how the contents of
> the ring buffer get discarded. we have another ring buffer in the
> kernel that takes data, then stops taking *new* data until the old
> contents were 'cleared.'
>
> either way, 'clear' doesn't sound like destroy, free, deinit, or
> shutdown. all of which have a more 'final' sound to it.
>
Renamed clear->destroy, and had clear do the simple clear but not free the
resources.
>
> > > What are these helpers doing?
> > >
> >
> > The overlap check is a write operation is going to overwrite the
> > block at which rdptr resides.
> > The skip, moves forward the rdptr until it no more overlaps with the
> > incoming write.
>
> Thanks. Please put that in a comment right above the functions.
>
Done.
Agreed, this is more of a "please try to avoid this in the future
> comment"
>
This code was composed by three categories.
First, totally new code files, for which, makes little sense to add single
functions in single commits, especially when you, along the way, totally
refactor older ones because the new APIs requires it (been told to squash
commits touching the same code, to avoid commenting on code which is no
more there).
Second, totally changed files, which really, had nothing more in common
with previous ones.
Third, removed a lot of files which were no more used.
Bottom line, you were looking at a multi thousand lines of code CR, whose
nature, and size, little would have helped some sugar coating in terms of
micro commits 😐
>
> > As for the timers, for Ron's case, you can still turn them on aside
> > from the profiler.
> > For a profiler POV, I think the default behavior should be them on
> > everywhere, otherwise you lose the full picture of what the system is
> > doing. We could add a selective CPU timer and profiler turn on, but
> > IMO the default should be all on.
>
> One problem is I don't see how to turn on the profiler on a single
> core. If you start collecting, you start the timer on all cores. I
> don't see why it can't be the way it was before. "Pick the timers you
> want, then turn on the accounting."
>
> I think fix is to have multiple commands. One is the timer (as is),
> one is to start just the collection, and one is the aggregate: all
> timers and start the collection. This is a combination of all the
> approaches. There's our existing style (timer, then collection) as
> well as your desired style (just turn the whole thing on damnit).
>
Let's please wait how this will be shaped, once the new overflow IRQ stuff
comes in.
If you can please review the MSR stuff, I am almost done with the new
profiler tool which uses libpfm4 to list CPU event counters, and resolve
them by name, allowing to start/stop/read/write the counters.
So the sequence is going to be:
1) this CR (code in place, CR under review)
2) MSR (code in place, CR posted)
3) New prof tool with libpfm4 (code at 75%, no CR yet)
4) New overflow based sampling (no code yet)
>
> > > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t
> > > > *pcs,
> > > > + size_t nr_slots)
> > > > +{
> > > > + size_t nr_pcs = 0;
> > > > +
> > > > + for (;;) {
> > > > + if (unlikely(fp >= UMAPTOP) || unlikely(fp <
> > > > BRK_END) ||
> > >
> > > Why not is_user_raddr()? And once you use copy_from, those checks
> > > aren't needed
> > > either. Also, I think it is possible for frame pointers to be below
> > > BRK_END,
> > > which could happen if a user stack was malloc'd instead of mmapped.
> > >
> > > > + unlikely(nr_pcs >= nr_slots))
> > > > + break;
> > > > +
> > > > + /* For now straight memory access, waiting for
> > > copy_from_user(). */
> > >
> > > copy_from is now available. =)
> > >
> >
> > I will do that is a separate CL.
>
> The beauty of Git is that you can make that small change (just use
> copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit",
> then do a rebase -i and meld that fixup into this existing commit.
>
Yes, but this branch became older than Matusalem, this is why I wanted to
defer.
Defer for me, is not a month ahead, is like 2 days (or even before, if this
gets merged).
Anyway, doing it in this CR for this.
> > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void)
> > > > +{
> > > > + static struct trace_printk_buffer boot_tpb;
> > > > + static struct trace_printk_buffer *cpu_tpbs;
> > > > +
> > > > + if (unlikely(booting))
> > > > + return &boot_tpb;
> > > > + if (unlikely(!cpu_tpbs)) {
> > > > + /* Poor man per-CPU data structure. I really do no
> > > > like
> > > littering global
> > > > + * data structures with module specific data.
> > > > + */
> > > > + spin_lock_irqsave(&ktrace_lock);
> > > > + if (!cpu_tpbs)
> > > > + cpu_tpbs = kzmalloc(num_cores *
> > > > sizeof(struct
> > > trace_printk_buffer),
> > > > +
> > > > 0);
> > >
> > > If this alloc fails, (flags == 0), then we'll return 0 (+coreid)
> > > and PF later.
> > >
> >
> > We could panic, or return boot tpb.
> > Today a OOM crashes the box, so I am not sure what to do here.
>
> Could do the alloc with KMALLOC_WAIT outside the lock. Then lock and
> try to do the assign. If someone else already assigned cpu_tpbs, then
> we free the one we just got.
>
I think it is better, because this might be called from printk(), from
places where we can't sleep, not to KMALLOC_WAIT.
Considering also the fact that this is called *very* early in the boot
process (due to linkage to printk), so if we do not have memory at that
point, we can pretty much panic() there.
>
> > > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE,
> > > > args)
> > > ^FALSE?
> > > We just want to print text, not a backtrace. That was also my
> > > intent for trace_printk (it's just the print, not a backtrace).
> > >
> >
> > Seems pretty useful to have the option to get a BT, to understand from
> > which path you are coming from.
> > A "grep" of the stirng simply tells where the print is.
>
> I like the option for the backtrace. I'm just saying that i'd like a
> simple trace_printk/printx that doesn't trigger the backtrace. we
> already have TRACEME for the backtrace trace_printk.
>
> Anyway, if you don't want to fix it, that's fine. If I get annoyed at
> a long backtrace in the future when trace_printxing I'll change it.
>
trace_printx() like the name suggests, traces BT.
printx() call printk(), which does not trace.
> > This probably should be in a library or header (so we can reuse it)
> > > and with a
> > > little explanation. It looks like we take 7 bits of n at a time
> > > and push it
> > > into *data, with bit 0x80 signifying we're done or not.
> > >
> >
> > Yep, that's what t does.
>
> Cool, please note that in the code. I have a bad memory. =)
>
Done.
> I wasn't sure if variable byte encoding would be something a kernel
> > would be doing elsewhere.
>
> I wasn't sure either. It sounds pretty cool. We can keep it here for
> now, or move it elsewhere if you'd like.
>
Yeah, when we have a search engine in the kernel, we can pull it out for
the posting lists 😀
As far as the age of this branch goes, it's a huge change that takes a
> while to get through. Maybe I'm just slow, but I've spent 2 days on
> this and counting.
>
Well, they say on AVG it takes 1/8 of the time taken to write the code, to
review it.
So do your math ☺
Ah, I see. In the future, you know you were given a block and you want
> to return that block in a ready-to-go manner.
>
> This is another case where we can make the function you need and add it
> to the block code. That way you also don't need to worry about
> checking the return value of this function (since iallocb can fail, but
> if you return the block you got originally, that can't fail).
>
> Not to harp on this too much, but I would rather have had that block
> code done as a support patch before this commit, and then used here,
> than doing it the way it is. That also would have been clearer to me,
> would have improved the kernel, and would have saved some time. =)
>
> But feel free to leave it as is now, and we can address that with a
> future TODO.
>
IMO it is a simple performance optimization, which can be done as follow up
(together with the other drop tail if overflow change).
Blocks are written at the rate of an handful per second.
>
> > > + qdropoverflow(profiler_queue, 1);
> > > > + qnonblock(profiler_queue, 1);
> > >
> > > As mentioned in a previous patch, you might not want qnonblock.
> > >
> >
> > Not sure why not?
>
> Because they do similar things in different ways.
>
Removed nonblock.
>
> > > > + profiler_percpu_ctx =
> > > > + kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores,
> > > KMALLOC_WAIT);
> > > > + if (!profiler_percpu_ctx)
> > > > + error(-ENOMEM, NULL);
> > >
> > > Since you did a KMALLOC_WAIT, you will never
> > > have !profiler_percpu_ctx
> >
> > I am not sure waiting is a guarantee that you will get memory anyway.
>
> I think it is in other OSs. In Akaros, the implementation isn't done
> yet, but the design is that if you say KMALLOC_WAIT (or whatever we
> call it later), that your thread will block until the memory allocation
> succeeds.
>
Removed the test for NULL.
IMO though, it is a bad idea to remove checks, as now you are bound to a
design.
>
> > > -static inline void profiler_begin_trace(struct profiler_cpu_context
> > > *cpu_buf)
> > > > +int profiler_configure(struct cmdbuf *cb)
> > > > {
> > > > - cpu_buf->tracing = 1;
> > > > + if (!strcmp(cb->f[0], "prof_qlimit")) {
> > > > + if (cb->nf < 2)
> > > > + error(EFAIL, "prof_qlimit KB");
> > > > + profiler_queue_limit = atoi(cb->f[1]) * 1024;
> > >
> > > Probably want a sanity check. Also, what happens when this changes
> > > after the
> > > queue was already allocated?
> > >
> >
> > The configuration must be done before the profiler is started.
>
> Ok, then we need a check to return an error for this case. Otherwise a
> buggy user could change that setting on the fly.
>
OK. I fail on queue limit, if running, while the percpu buffer size can be
changed on the fly.
> > > But there's more to it than that. The assumption here is that once
> > > profiler_users == 0, then there is no code, (RKM, pending IRQ,
> > > etc.) that has
> > > access to profiler_percpu_ctx or profiler_queue.
> > >
> > > Just as an example, say someone is calling profiler_notify_mmap(),
> > > which checks
> > > profiler_percpu_ctx. They get past the if () check, then
> > > concurrently someone
> > > calls profiler_cleanup and triggers free_cpu_buffers. Then the
> > > original thread
> > > eventually calls profiler_push_pid_mmap, then qwrite, then page
> > > faults.
> > >
> > > So a user could theoretically trigger a PF. This is part of the
> > > reason why I
> > > was reluctant to have you try and free the buffers.
> > >
> >
> > Well, we have no RCU ___
> > In such case we should swap ptr with NULL, and RCU-free it.
>
> RCU wouldn't help as much here either, since the profiler_notify_mmap()
> thread could have blocked at some point. And sleeping RCU sounds like
> a pain.
>
> And I imagine there are other places than this one example where it is
> dangerous to free the buffers.
>
> So I'm not convinced that this assumption is true:
>
Sleeping RCU is not a problem. We did it in userspace (arguably more
difficult than kernel) in Cassini, and the whole thing was less 1000 lines
of code 😉
>
> > > The assumption here is that once profiler_users == 0, then there
> > > is no code, (RKM, pending IRQ, etc.) that has access to
> > > profiler_percpu_ctx or profiler_queue.
>
> One simple "fix" would be to not free the buffers, and put a TODO on
> it. that would alleviate the risk of the page fault (possibly caused
> by a malicious user) at the expense of a small amount of memory.
>
Here we have two paths.
The one used to write normal traces, that do not block (better be, being
called from timer IRQ).
The one coming from mmap and process announcements, which ATM call
qwrite(), which might block
I changed those to qiwrite(), as well the dropping the KMALLOC_WAIT in the
functions that post mmap and process announcements.
This better be, as enum vmrs runs with a spinlock held.
Using kref coupled with qlock, we should be able to get the necessary
semantics.
I added two commits (WIP:PROFILER and WIP:KPROF) at the end of the stream,
which did not squash in the proper place yet.
Take a look. I really do not like the idea of having resources left
allocated when the functionalities are not used.
--
You received this message because you are subscribed to the Google Groups
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.