Also, I am planning the new perf tool which is is the devarch_msr_cont branch, to also have options to activate and deactivate the profiler, hiding the /prof/ VFS internals. Like:
$ prof --start-profiler ..other_options.. $ prof --stop-profiler ..other_options.. ... Eventually I will move the vast part of the kprof2perf file format converter tool, in a library, that both the new prof tool, and kprof2perf can use. On Sat, Nov 14, 2015 at 6:06 AM, Davide Libenzi <[email protected]> wrote: > Here a link to the top of the branch where the two WIP are: > > https://github.com/dlibenzi/akaros/commits/profiler_dev_v2 > > > On Fri, Nov 13, 2015 at 7:54 PM, Davide Libenzi <[email protected]> > wrote: > >> On Fri, Nov 13, 2015 at 9:19 AM, barret rhoden <[email protected]> >> wrote: >> >>> > > > */ qdropoverflow(profiler_queue, 1); >>> > > >>> > > Then later: >>> > > >>> > > > + qnonblock(profiler_queue, 1); >>> > > >>> > > This is fragile. qnonblock means "error with EAGAIN if we're full". >>> > > Right now, you'll get by, since you still have qdropoverflow >>> > > above. But this could break with an error() thrown from IRQ >>> > > context somewhere down the line. >>> > > >>> > >>> > Yes, there are paired. I am not sure which kind of bad condition can >>> > happen. >>> >>> The bad condition is that you may unexpectedly get an error(). If you >>> have dropoverflow, nonblock does nothing, so do not set it. things may >>> change in qio in the future. >>> >> >> OK, removing nonblock. >> Coming to error(), should I always waserror() in my code, even if I have >> nothing to be undone? >> Because, that kinda defeats the exception-look-alike error model. >> >> >> > > > void profiler_cleanup(void) >> >>> > > > { >>> > > > - free_cpu_buffers(); >>> > > > + sem_down(&mtx); >>> > > > + profiler_users--; >>> > > > + if (profiler_users == 0) >>> > > > + free_cpu_buffers(); >>> > > > + sem_up(&mtx); >>> > > > } >>> > > >>> > > This might be a candidate for a kref, depending on how cleanup ends >>> > > up working. No need to change it now though. >>> > > >>> > >>> > kref don't block until the 0->1 transitioner is initializing, no? >>> >>> krefs never block, so i don't fully understand. my comment was that >>> cleanup seems dangerous here. this code is hand-rolling the kref >>> pattern, so in the end a kref might be the right tool here. >>> >> >> Is not hand rolling, because kref will not make the immediately following >> thread going to up-it, wait for the first one to do the proper >> initialization of the resources. >> But I think a blocking lock for init, coupled with a kref, might solve >> the buffers free issue. See down below ... >> >> >> >> >> > Locking is up to the reader. >>> > In a circular buffer fashion, a write on full buffer discard oldest >>> > buffer, and write new one. >>> > No, reading data does not remove data. The only way the read ptr is >>> > moved forward, is when the write ptr pushes it forward in order to >>> > fit new data. Every block in the buffer, is pre-pended with its size. >>> > That is the type of the size. >>> >>> Great. Please put that in the source too. >>> >> >> circular_buffer_write already has the comments in its body, which describes >> the operation. >> Adding some more to the skip API. >> >> >> >> >>> >>> > > It's a little surprising that clear also frees the allocation, >>> > > instead of just >>> > > resetting everything back to 0. >>> > > >>> > >>> > This clears all the resources of the buffer. >>> > I used "clear" as there is an "init". >>> > I would have used "free" if there was an "alloc". >>> >>> This was originally unclear since it was not clear how the contents of >>> the ring buffer get discarded. we have another ring buffer in the >>> kernel that takes data, then stops taking *new* data until the old >>> contents were 'cleared.' >>> >>> either way, 'clear' doesn't sound like destroy, free, deinit, or >>> shutdown. all of which have a more 'final' sound to it. >>> >> >> Renamed clear->destroy, and had clear do the simple clear but not free >> the resources. >> >> >> >> >>> >>> > > What are these helpers doing? >>> > > >>> > >>> > The overlap check is a write operation is going to overwrite the >>> > block at which rdptr resides. >>> > The skip, moves forward the rdptr until it no more overlaps with the >>> > incoming write. >>> >>> Thanks. Please put that in a comment right above the functions. >>> >> >> Done. >> >> >> >> Agreed, this is more of a "please try to avoid this in the future >>> comment" >>> >> >> This code was composed by three categories. >> First, totally new code files, for which, makes little sense to add >> single functions in single commits, especially when you, along the way, >> totally refactor older ones because the new APIs requires it (been told to >> squash commits touching the same code, to avoid commenting on code which is >> no more there). >> Second, totally changed files, which really, had nothing more in common >> with previous ones. >> Third, removed a lot of files which were no more used. >> Bottom line, you were looking at a multi thousand lines of code CR, whose >> nature, and size, little would have helped some sugar coating in terms of >> micro commits 😐 >> >> >> >> >>> >>> > As for the timers, for Ron's case, you can still turn them on aside >>> > from the profiler. >>> > For a profiler POV, I think the default behavior should be them on >>> > everywhere, otherwise you lose the full picture of what the system is >>> > doing. We could add a selective CPU timer and profiler turn on, but >>> > IMO the default should be all on. >>> >>> One problem is I don't see how to turn on the profiler on a single >>> core. If you start collecting, you start the timer on all cores. I >>> don't see why it can't be the way it was before. "Pick the timers you >>> want, then turn on the accounting." >>> >>> I think fix is to have multiple commands. One is the timer (as is), >>> one is to start just the collection, and one is the aggregate: all >>> timers and start the collection. This is a combination of all the >>> approaches. There's our existing style (timer, then collection) as >>> well as your desired style (just turn the whole thing on damnit). >>> >> >> Let's please wait how this will be shaped, once the new overflow IRQ >> stuff comes in. >> If you can please review the MSR stuff, I am almost done with the new >> profiler tool which uses libpfm4 to list CPU event counters, and resolve >> them by name, allowing to start/stop/read/write the counters. >> So the sequence is going to be: >> >> 1) this CR (code in place, CR under review) >> 2) MSR (code in place, CR posted) >> 3) New prof tool with libpfm4 (code at 75%, no CR yet) >> 4) New overflow based sampling (no code yet) >> >> >> >> >>> >>> > > > +size_t user_backtrace_list(uintptr_t pc, uintptr_t fp, uintptr_t >>> > > > *pcs, >>> > > > + size_t nr_slots) >>> > > > +{ >>> > > > + size_t nr_pcs = 0; >>> > > > + >>> > > > + for (;;) { >>> > > > + if (unlikely(fp >= UMAPTOP) || unlikely(fp < >>> > > > BRK_END) || >>> > > >>> > > Why not is_user_raddr()? And once you use copy_from, those checks >>> > > aren't needed >>> > > either. Also, I think it is possible for frame pointers to be below >>> > > BRK_END, >>> > > which could happen if a user stack was malloc'd instead of mmapped. >>> > > >>> > > > + unlikely(nr_pcs >= nr_slots)) >>> > > > + break; >>> > > > + >>> > > > + /* For now straight memory access, waiting for >>> > > copy_from_user(). */ >>> > > >>> > > copy_from is now available. =) >>> > > >>> > >>> > I will do that is a separate CL. >>> >>> The beauty of Git is that you can make that small change (just use >>> copy_from and remove the UMAPTOP and BRK_END checks) as a "WIP-commit", >>> then do a rebase -i and meld that fixup into this existing commit. >>> >> >> Yes, but this branch became older than Matusalem, this is why I wanted to >> defer. >> Defer for me, is not a month ahead, is like 2 days (or even before, if >> this gets merged). >> Anyway, doing it in this CR for this. >> >> >> >> > > > +static struct trace_printk_buffer *kprof_get_printk_buffer(void) >>> > > > +{ >>> > > > + static struct trace_printk_buffer boot_tpb; >>> > > > + static struct trace_printk_buffer *cpu_tpbs; >>> > > > + >>> > > > + if (unlikely(booting)) >>> > > > + return &boot_tpb; >>> > > > + if (unlikely(!cpu_tpbs)) { >>> > > > + /* Poor man per-CPU data structure. I really do no >>> > > > like >>> > > littering global >>> > > > + * data structures with module specific data. >>> > > > + */ >>> > > > + spin_lock_irqsave(&ktrace_lock); >>> > > > + if (!cpu_tpbs) >>> > > > + cpu_tpbs = kzmalloc(num_cores * >>> > > > sizeof(struct >>> > > trace_printk_buffer), >>> > > > + >>> > > > 0); >>> > > >>> > > If this alloc fails, (flags == 0), then we'll return 0 (+coreid) >>> > > and PF later. >>> > > >>> > >>> > We could panic, or return boot tpb. >>> > Today a OOM crashes the box, so I am not sure what to do here. >>> >>> Could do the alloc with KMALLOC_WAIT outside the lock. Then lock and >>> try to do the assign. If someone else already assigned cpu_tpbs, then >>> we free the one we just got. >>> >> >> I think it is better, because this might be called from printk(), from >> places where we can't sleep, not to KMALLOC_WAIT. >> Considering also the fact that this is called *very* early in the boot >> process (due to linkage to printk), so if we do not have memory at that >> point, we can pretty much panic() there. >> >> >> >> >>> >>> > > > +#define trace_printx(args...) if (printx_on) trace_printk(TRUE, >>> > > > args) >>> > > ^FALSE? >>> > > We just want to print text, not a backtrace. That was also my >>> > > intent for trace_printk (it's just the print, not a backtrace). >>> > > >>> > >>> > Seems pretty useful to have the option to get a BT, to understand from >>> > which path you are coming from. >>> > A "grep" of the stirng simply tells where the print is. >>> >>> I like the option for the backtrace. I'm just saying that i'd like a >>> simple trace_printk/printx that doesn't trigger the backtrace. we >>> already have TRACEME for the backtrace trace_printk. >>> >>> Anyway, if you don't want to fix it, that's fine. If I get annoyed at >>> a long backtrace in the future when trace_printxing I'll change it. >>> >> >> trace_printx() like the name suggests, traces BT. >> printx() call printk(), which does not trace. >> >> >> >> > > This probably should be in a library or header (so we can reuse it) >>> > > and with a >>> > > little explanation. It looks like we take 7 bits of n at a time >>> > > and push it >>> > > into *data, with bit 0x80 signifying we're done or not. >>> > > >>> > >>> > Yep, that's what t does. >>> >>> Cool, please note that in the code. I have a bad memory. =) >>> >> >> Done. >> >> >> >> > I wasn't sure if variable byte encoding would be something a kernel >>> > would be doing elsewhere. >>> >>> I wasn't sure either. It sounds pretty cool. We can keep it here for >>> now, or move it elsewhere if you'd like. >>> >> >> Yeah, when we have a search engine in the kernel, we can pull it out for >> the posting lists 😀 >> >> >> >> As far as the age of this branch goes, it's a huge change that takes a >>> while to get through. Maybe I'm just slow, but I've spent 2 days on >>> this and counting. >>> >> >> Well, they say on AVG it takes 1/8 of the time taken to write the code, >> to review it. >> So do your math ☺ >> >> >> Ah, I see. In the future, you know you were given a block and you want >>> to return that block in a ready-to-go manner. >>> >>> This is another case where we can make the function you need and add it >>> to the block code. That way you also don't need to worry about >>> checking the return value of this function (since iallocb can fail, but >>> if you return the block you got originally, that can't fail). >>> >>> Not to harp on this too much, but I would rather have had that block >>> code done as a support patch before this commit, and then used here, >>> than doing it the way it is. That also would have been clearer to me, >>> would have improved the kernel, and would have saved some time. =) >>> >>> But feel free to leave it as is now, and we can address that with a >>> future TODO. >>> >> >> IMO it is a simple performance optimization, which can be done as follow >> up (together with the other drop tail if overflow change). >> Blocks are written at the rate of an handful per second. >> >> >> >> >>> >>> > > + qdropoverflow(profiler_queue, 1); >>> > > > + qnonblock(profiler_queue, 1); >>> > > >>> > > As mentioned in a previous patch, you might not want qnonblock. >>> > > >>> > >>> > Not sure why not? >>> >>> Because they do similar things in different ways. >>> >> >> Removed nonblock. >> >> >> >>> >>> > > > + profiler_percpu_ctx = >>> > > > + kzmalloc(sizeof(*profiler_percpu_ctx) * num_cores, >>> > > KMALLOC_WAIT); >>> > > > + if (!profiler_percpu_ctx) >>> > > > + error(-ENOMEM, NULL); >>> > > >>> > > Since you did a KMALLOC_WAIT, you will never >>> > > have !profiler_percpu_ctx >>> > >>> > I am not sure waiting is a guarantee that you will get memory anyway. >>> >>> I think it is in other OSs. In Akaros, the implementation isn't done >>> yet, but the design is that if you say KMALLOC_WAIT (or whatever we >>> call it later), that your thread will block until the memory allocation >>> succeeds. >>> >> >> Removed the test for NULL. >> IMO though, it is a bad idea to remove checks, as now you are bound to a >> design. >> >> >> >> >>> >>> > > -static inline void profiler_begin_trace(struct profiler_cpu_context >>> > > *cpu_buf) >>> > > > +int profiler_configure(struct cmdbuf *cb) >>> > > > { >>> > > > - cpu_buf->tracing = 1; >>> > > > + if (!strcmp(cb->f[0], "prof_qlimit")) { >>> > > > + if (cb->nf < 2) >>> > > > + error(EFAIL, "prof_qlimit KB"); >>> > > > + profiler_queue_limit = atoi(cb->f[1]) * 1024; >>> > > >>> > > Probably want a sanity check. Also, what happens when this changes >>> > > after the >>> > > queue was already allocated? >>> > > >>> > >>> > The configuration must be done before the profiler is started. >>> >>> Ok, then we need a check to return an error for this case. Otherwise a >>> buggy user could change that setting on the fly. >>> >> >> OK. I fail on queue limit, if running, while the percpu buffer size can >> be changed on the fly. >> >> >> >>> > > But there's more to it than that. The assumption here is that once >>> > > profiler_users == 0, then there is no code, (RKM, pending IRQ, >>> > > etc.) that has >>> > > access to profiler_percpu_ctx or profiler_queue. >>> > > >>> > > Just as an example, say someone is calling profiler_notify_mmap(), >>> > > which checks >>> > > profiler_percpu_ctx. They get past the if () check, then >>> > > concurrently someone >>> > > calls profiler_cleanup and triggers free_cpu_buffers. Then the >>> > > original thread >>> > > eventually calls profiler_push_pid_mmap, then qwrite, then page >>> > > faults. >>> > > >>> > > So a user could theoretically trigger a PF. This is part of the >>> > > reason why I >>> > > was reluctant to have you try and free the buffers. >>> > > >>> > >>> > Well, we have no RCU ___ >>> > In such case we should swap ptr with NULL, and RCU-free it. >>> >>> RCU wouldn't help as much here either, since the profiler_notify_mmap() >>> thread could have blocked at some point. And sleeping RCU sounds like >>> a pain. >>> >>> And I imagine there are other places than this one example where it is >>> dangerous to free the buffers. >>> >>> So I'm not convinced that this assumption is true: >>> >> >> Sleeping RCU is not a problem. We did it in userspace (arguably more >> difficult than kernel) in Cassini, and the whole thing was less 1000 lines >> of code 😉 >> >> >> >>> >>> > > The assumption here is that once profiler_users == 0, then there >>> > > is no code, (RKM, pending IRQ, etc.) that has access to >>> > > profiler_percpu_ctx or profiler_queue. >>> >>> One simple "fix" would be to not free the buffers, and put a TODO on >>> it. that would alleviate the risk of the page fault (possibly caused >>> by a malicious user) at the expense of a small amount of memory. >>> >> >> Here we have two paths. >> The one used to write normal traces, that do not block (better be, being >> called from timer IRQ). >> The one coming from mmap and process announcements, which ATM call >> qwrite(), which might block >> I changed those to qiwrite(), as well the dropping the KMALLOC_WAIT in >> the functions that post mmap and process announcements. >> This better be, as enum vmrs runs with a spinlock held. >> >> Using kref coupled with qlock, we should be able to get the necessary >> semantics. >> I added two commits (WIP:PROFILER and WIP:KPROF) at the end of the >> stream, which did not squash in the proper place yet. >> Take a look. I really do not like the idea of having resources left >> allocated when the functionalities are not used. >> >> >> >> >> > -- You received this message because you are subscribed to the Google Groups "Akaros" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/d/optout.
