Here a link to the top of the branch where the two WIP are: https://github.com/dlibenzi/akaros/commits/profiler_dev_v2
On Fri, Nov 13, 2015 at 7:54 PM, Davide Libenzi <[email protected]> wrote: > On Fri, Nov 13, 2015 at 9:19 AM, barret rhoden <[email protected]> > wrote: > >> > > > */ qdropoverflow(profiler_queue, 1); >> > > >> > > Then later: >> > > >> > > > + qnonblock(profiler_queue, 1); >> > > >> > > This is fragile. qnonblock means "error with EAGAIN if we're full". >> > > Right now, you'll get by, since you still have qdropoverflow >> > > above. But this could break with an error() thrown from IRQ >> > > context somewhere down the line. >> > > >> > >> > Yes, there are paired. I am not sure which kind of bad condition can >> > happen. >> >> The bad condition is that you may unexpectedly get an error(). If you >> have dropoverflow, nonblock does nothing, so do not set it. things may >> change in qio in the future. >> > > OK, removing nonblock. > Coming to error(), should I always waserror() in my code, even if I have > nothing to be undone? > Because, that kinda defeats the exception-look-alike error model. > > > > > > void profiler_cleanup(void) > >> > > > { >> > > > - free_cpu_buffers(); >> > > > + sem_down(&mtx); >> > > > + profiler_users--; >> > > > + if (profiler_users == 0) >> > > > + free_cpu_buffers(); >> > > > + sem_up(&mtx); >> > > > } >> > > >> > > This might be a candidate for a kref, depending on how cleanup ends >> > > up working. No need to change it now though. >> > > >> > >> > kref don't block until the 0->1 transitioner is initializing, no? >> >> krefs never block, so i don't fully understand. my comment was that >> cleanup seems dangerous here. this code is hand-rolling the kref >> pattern, so in the end a kref might be the right tool here. >> > > Is not hand rolling, because kref will not make the immediately following > thread going to up-it, wait for the first one to do the proper > initialization of the resources. > But I think a blocking lock for init, coupled with a kref, might solve the > buffers free issue. See down below ... > > > > > > Locking is up to the reader. >> > In a circular buffer fashion, a write on full buffer discard oldest >> > buffer, and write new one. >> > No, reading data does not remove data. The only way the read ptr is >> > moved forward, is when the write ptr pushes it forward in order to >> > fit new data. Every block in the buffer, is pre-pended with its size. >> > That is the type of the size. >> >> Great. Please put that in the source too. >> > > circular_buffer_write already has the comments in its body, which describes > the operation. > Adding some more to the skip API. > > > > >> >> > > It's a little surprising that clear also frees the allocation, >> > > instead of just >> > > resetting everything back to 0. >> > > >> > >> > This clears all the resources of the buffer. >> > I used "clear" as there is an "init". >> > I would have used "free" if there was an "alloc". >> >> This was originally unclear since it was not clear how the contents of >> the ring buffer get discarded. we have another ring buffer in the >> kernel that takes data, then stops taking *new* data until the old >> contents were 'cleared.' >> >> either way, 'clear' doesn't sound like destroy, free, deinit, or >> shutdown. all of which have a more 'final' sound to it. >> > > Renamed clear->destroy, and had clear do the simple clear but not free the > resources. > > > > >> >> > > What are these helpers doing? >> > > >> > >> > The overlap check is a write operation is going to overwrite the >> > block at which rdptr resides. >> > The skip, moves forward the rdptr until it no more overlaps with the >> > incoming write. >> >> Thanks. Please put that in a comment right above the functions. >> > > Done. > > > > Agreed, this is more of a "please try to avoid this in the future >> comment" >> > > This code was composed by three categories. > First, totally new code files, for which, makes little sense to add single > functions in single commits, especially when you, along the way, totally > refactor older ones because the new APIs requires it (been told to squash > commits touching the same code, to avoid commenting on code which is no > more there). > Second, totally changed files, which really, had nothing more in common > with previous ones. > Third, removed a lot of files which were no more used. > Bottom line, you were looking at a multi thousand lines of code CR, whose > nature, and size, little would have helped some sugar coating in terms of > micro commits 😐 > > > > >> >> > As for the timers, for Ron's case, you can still turn them on aside >> > from the profiler. >> > For a profiler POV, I think the default behavior should be them on >> > everywhere, otherwise you lose the full picture of what the system is >> > doing. We could add a selective CPU timer and profiler turn on, but >> > IMO the default should be all on. >> >> One problem is I don't see how to turn on the profiler on a single >> core. If you start collecting, you start the timer on all cores. I >> don't see why it can't be the way it was before. "Pick the timers you >> want, then turn on the accounting." >> >> I think fix is to have multiple commands. One is the timer (as is), >> one is to start just the collection, and one is the aggregate: all >> timers and start the collection. This is a combination of all the >> approaches. There's our existing style (timer, then collection) as >> well as your desired style (just turn the whole thing on damnit). >> > > Let's please wait how this will be shaped, once the new overflow IRQ stuff > comes in. > If you can please review the MSR stuff, I am almost done with the new > profiler tool which uses libpfm4 to list CPU event counters, and resolve > them by name, allowing to start/stop/read/write the counters. > So the sequence is going to be: > > 1) this CR (code in place, CR under review) > 2) MSR (code in place, CR posted) > 3) New prof tool with libpfm4 (code at 75%, no CR yet) > 4) New overflow based sampling (no code yet) > > > > >> >> > > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t >> > > > *pcs, >> > > > + size_t nr_slots) >> > > > +{ >> > > > + size_t nr_pcs = 0; >> > > > + >> > > > + for (;;) { >> > > > + if (unlikely(fp >= UMAPTOP) || unlikely(fp < >> > > > BRK_END) || >> > > >> > > Why not is_user_raddr()? And once you use copy_from, those checks >> > > aren't needed >> > > either. Also, I think it is possible for frame pointers to be below >> > > BRK_END, >> > > which could happen if a user stack was malloc'd instead of mmapped. >> > > >> > > > + unlikely(nr_pcs >= nr_slots)) >> > > > + break; >> > > > + >> > > > + /* For now straight memory access, waiting for >> > > copy_from_user(). */ >> > > >> > > copy_from is now available. =) >> > > >> > >> > I will do that is a separate CL. >> >> The beauty of Git is that you can make that small change (just use >> copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit", >> then do a rebase -i and meld that fixup into this existing commit. >> > > Yes, but this branch became older than Matusalem, this is why I wanted to > defer. > Defer for me, is not a month ahead, is like 2 days (or even before, if > this gets merged). > Anyway, doing it in this CR for this. > > > > > > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void) >> > > > +{ >> > > > + static struct trace_printk_buffer boot_tpb; >> > > > + static struct trace_printk_buffer *cpu_tpbs; >> > > > + >> > > > + if (unlikely(booting)) >> > > > + return &boot_tpb; >> > > > + if (unlikely(!cpu_tpbs)) { >> > > > + /* Poor man per-CPU data structure. I really do no >> > > > like >> > > littering global >> > > > + * data structures with module specific data. >> > > > + */ >> > > > + spin_lock_irqsave(&ktrace_lock); >> > > > + if (!cpu_tpbs) >> > > > + cpu_tpbs = kzmalloc(num_cores * >> > > > sizeof(struct >> > > trace_printk_buffer), >> > > > + >> > > > 0); >> > > >> > > If this alloc fails, (flags == 0), then we'll return 0 (+coreid) >> > > and PF later. >> > > >> > >> > We could panic, or return boot tpb. >> > Today a OOM crashes the box, so I am not sure what to do here. >> >> Could do the alloc with KMALLOC_WAIT outside the lock. Then lock and >> try to do the assign. If someone else already assigned cpu_tpbs, then >> we free the one we just got. >> > > I think it is better, because this might be called from printk(), from > places where we can't sleep, not to KMALLOC_WAIT. > Considering also the fact that this is called *very* early in the boot > process (due to linkage to printk), so if we do not have memory at that > point, we can pretty much panic() there. > > > > >> >> > > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE, >> > > > args) >> > > ^FALSE? >> > > We just want to print text, not a backtrace. That was also my >> > > intent for trace_printk (it's just the print, not a backtrace). >> > > >> > >> > Seems pretty useful to have the option to get a BT, to understand from >> > which path you are coming from. >> > A "grep" of the stirng simply tells where the print is. >> >> I like the option for the backtrace. I'm just saying that i'd like a >> simple trace_printk/printx that doesn't trigger the backtrace. we >> already have TRACEME for the backtrace trace_printk. >> >> Anyway, if you don't want to fix it, that's fine. If I get annoyed at >> a long backtrace in the future when trace_printxing I'll change it. >> > > trace_printx() like the name suggests, traces BT. > printx() call printk(), which does not trace. > > > > > > This probably should be in a library or header (so we can reuse it) >> > > and with a >> > > little explanation. It looks like we take 7 bits of n at a time >> > > and push it >> > > into *data, with bit 0x80 signifying we're done or not. >> > > >> > >> > Yep, that's what t does. >> >> Cool, please note that in the code. I have a bad memory. =) >> > > Done. > > > > > I wasn't sure if variable byte encoding would be something a kernel >> > would be doing elsewhere. >> >> I wasn't sure either. It sounds pretty cool. We can keep it here for >> now, or move it elsewhere if you'd like. >> > > Yeah, when we have a search engine in the kernel, we can pull it out for > the posting lists 😀 > > > > As far as the age of this branch goes, it's a huge change that takes a >> while to get through. Maybe I'm just slow, but I've spent 2 days on >> this and counting. >> > > Well, they say on AVG it takes 1/8 of the time taken to write the code, to > review it. > So do your math ☺ > > > Ah, I see. In the future, you know you were given a block and you want >> to return that block in a ready-to-go manner. >> >> This is another case where we can make the function you need and add it >> to the block code. That way you also don't need to worry about >> checking the return value of this function (since iallocb can fail, but >> if you return the block you got originally, that can't fail). >> >> Not to harp on this too much, but I would rather have had that block >> code done as a support patch before this commit, and then used here, >> than doing it the way it is. That also would have been clearer to me, >> would have improved the kernel, and would have saved some time. =) >> >> But feel free to leave it as is now, and we can address that with a >> future TODO. >> > > IMO it is a simple performance optimization, which can be done as follow > up (together with the other drop tail if overflow change). > Blocks are written at the rate of an handful per second. > > > > >> >> > > + qdropoverflow(profiler_queue, 1); >> > > > + qnonblock(profiler_queue, 1); >> > > >> > > As mentioned in a previous patch, you might not want qnonblock. >> > > >> > >> > Not sure why not? >> >> Because they do similar things in different ways. >> > > Removed nonblock. > > > >> >> > > > + profiler_percpu_ctx = >> > > > + kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores, >> > > KMALLOC_WAIT); >> > > > + if (!profiler_percpu_ctx) >> > > > + error(-ENOMEM, NULL); >> > > >> > > Since you did a KMALLOC_WAIT, you will never >> > > have !profiler_percpu_ctx >> > >> > I am not sure waiting is a guarantee that you will get memory anyway. >> >> I think it is in other OSs. In Akaros, the implementation isn't done >> yet, but the design is that if you say KMALLOC_WAIT (or whatever we >> call it later), that your thread will block until the memory allocation >> succeeds. >> > > Removed the test for NULL. > IMO though, it is a bad idea to remove checks, as now you are bound to a > design. > > > > >> >> > > -static inline void profiler_begin_trace(struct profiler_cpu_context >> > > *cpu_buf) >> > > > +int profiler_configure(struct cmdbuf *cb) >> > > > { >> > > > - cpu_buf->tracing = 1; >> > > > + if (!strcmp(cb->f[0], "prof_qlimit")) { >> > > > + if (cb->nf < 2) >> > > > + error(EFAIL, "prof_qlimit KB"); >> > > > + profiler_queue_limit = atoi(cb->f[1]) * 1024; >> > > >> > > Probably want a sanity check. Also, what happens when this changes >> > > after the >> > > queue was already allocated? >> > > >> > >> > The configuration must be done before the profiler is started. >> >> Ok, then we need a check to return an error for this case. Otherwise a >> buggy user could change that setting on the fly. >> > > OK. I fail on queue limit, if running, while the percpu buffer size can be > changed on the fly. > > > >> > > But there's more to it than that. The assumption here is that once >> > > profiler_users == 0, then there is no code, (RKM, pending IRQ, >> > > etc.) that has >> > > access to profiler_percpu_ctx or profiler_queue. >> > > >> > > Just as an example, say someone is calling profiler_notify_mmap(), >> > > which checks >> > > profiler_percpu_ctx. They get past the if () check, then >> > > concurrently someone >> > > calls profiler_cleanup and triggers free_cpu_buffers. Then the >> > > original thread >> > > eventually calls profiler_push_pid_mmap, then qwrite, then page >> > > faults. >> > > >> > > So a user could theoretically trigger a PF. This is part of the >> > > reason why I >> > > was reluctant to have you try and free the buffers. >> > > >> > >> > Well, we have no RCU ___ >> > In such case we should swap ptr with NULL, and RCU-free it. >> >> RCU wouldn't help as much here either, since the profiler_notify_mmap() >> thread could have blocked at some point. And sleeping RCU sounds like >> a pain. >> >> And I imagine there are other places than this one example where it is >> dangerous to free the buffers. >> >> So I'm not convinced that this assumption is true: >> > > Sleeping RCU is not a problem. We did it in userspace (arguably more > difficult than kernel) in Cassini, and the whole thing was less 1000 lines > of code 😉 > > > >> >> > > The assumption here is that once profiler_users == 0, then there >> > > is no code, (RKM, pending IRQ, etc.) that has access to >> > > profiler_percpu_ctx or profiler_queue. >> >> One simple "fix" would be to not free the buffers, and put a TODO on >> it. that would alleviate the risk of the page fault (possibly caused >> by a malicious user) at the expense of a small amount of memory. >> > > Here we have two paths. > The one used to write normal traces, that do not block (better be, being > called from timer IRQ). > The one coming from mmap and process announcements, which ATM call > qwrite(), which might block > I changed those to qiwrite(), as well the dropping the KMALLOC_WAIT in the > functions that post mmap and process announcements. > This better be, as enum vmrs runs with a spinlock held. > > Using kref coupled with qlock, we should be able to get the necessary > semantics. > I added two commits (WIP:PROFILER and WIP:KPROF) at the end of the stream, > which did not squash in the proper place yet. > Take a look. I really do not like the idea of having resources left > allocated when the functionalities are not used. > > > > > -- You received this message because you are subscribed to the Google Groups "Akaros" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/d/optout.
