Hi, I worked on these 2 optimizations to libunwind: - I replaced the global cache by a thread-local cache. This removes the need for some locking. - Instead of restoring all registers for every stack frame, I restore only EBP, EIP and ESP.
The code is here: https://github.com/fdoray/libunwind/tree/minimal_regs Warning: This version of the library is no longer signal-safe and has undefined behavior if more registers than EBP/EIP/ESP are required to unwind a stack frame (this never happened in the few tests that I made so far). These two limitations could easily be overcome in the future. Performance results, on x86_64: unw_backtrace(), original libunwind Mean time per backtrace: 6130 ns / 80% of samples between 1479 and 13837 ns unw_backtrace(), modified libunwind *** Mean time per backtrace: 4255 ns / 80% of samples between 1526 and 5252 ns. unw_step()/unw_get_reg() [1], original libuwndin Mean time per backtrace: 43520 ns / 80% of samples between 13705 and 58782 ns. unw_step()/unw_get_reg(), modified libunwind Mean time per backtrace: 5804 ns / 80% of samples between 2844 and 11325 ns. Francois [1] As in the example presented here: http://www.nongnu.org/libunwind/man/libunwind(3).html On Mon, Mar 16, 2015 at 8:35 PM, Brian Robbins <[email protected]> wrote: > Hi All, > > > > This is great. Thank you very much for the information. > > > > -Brian > > > > *From:* Francis Giraldeau [mailto:[email protected]] > *Sent:* Thursday, March 12, 2015 11:34 AM > *To:* Mathieu Desnoyers > *Cc:* Brian Robbins; [email protected] > *Subject:* Re: [lttng-dev] Userspace Tracing and Backtraces > > > > 2015-03-10 21:47 GMT-04:00 Mathieu Desnoyers < > [email protected]>: > > Francis: Did you define UNW_LOCAL_ONLY before including > > the libunwind header in your benchmarks ? (see > > http://www.nongnu.org/libunwind/man/libunwind%283%29.html) > > > > The seems to change performance dramatically according to the > documentation. > > > > > > Yes, this is the case. Time to unwind is higher at the beginning (probably > related to internal cache build), and also vary according to call-stack > depth. > > > > Agreed on having the backtrace as a context. The main question left is > > to figure out if we want to call libunwind from within the traced > application > > execution context. > > > > Unfortunately, libunwind is not reentrant wrt signals. This is already > > a good argument for not calling it from within a tracepoint. I wonder > > if the authors of libunwind would be open to make it signal-reentrant > > in the future (not by disabling signals, but rather by keeping a TLS > > nesting counter, and returning an error if nested, for performance > > considerations). > > > > The functions unw_init_local() and unw_step() are signal safe [1]. The > critical sections are protected using lock_acquire() that blocks all > signals before taking the mutex, which prevent the recursion. > > > > #define lock_acquire(l,m) \ > > do { \ > > SIGPROCMASK (SIG_SETMASK, &unwi_full_mask, &(m)); \ > > mutex_lock (l); \ > > } while (0) > > #define lock_release(l,m) \ > > do { \ > > mutex_unlock (l); \ > > SIGPROCMASK (SIG_SETMASK, &(m), NULL); \ > > } while (0) > > > > To understand the implications, I did a small program to study nested > signals [2], where a signal is sent from within a signal, or when > segmentation fault occurs in a signal handler. Blocking a signal differs it > when it is unblocked, while ignored signals are discarded. Blocked signals > that can't be ignored have their default behaviour. It prevents a possible > deadlock, let's say if lock_acquire() was nesting with a custom SIGSEGV > handler trying to get the same lock. > > > > So, let's say that instead of blocking signals, we have a per-thread > mutex, that returns if try_lock() fails. It would be faster, but from the > user's point of view, the backtrace will be dropped randomly. I would > prefer it a bit slower, but reliable. > > > > In addition, could it be possible that TLS is not signal safe [3]? > > or using the perf capture mechanism that you describe below? > > Perf is peeking at the userspace from kernel space, it's another story. > I guess that libunwind was not ported to the kernel because it is a large > chunk of complicated code that performs a lot of I/O and computation, while > copying a portion of the stack is really about KISS and low runtime > overhead. > > If using libunwind does not work out, another alternative I would > consider > > would be to copy the stack like perf is doing from the kernel. However, > > in the spirit of compacting trace data, I would be tempted to do the > following > > if we go down that route: check each pointer-aligned address for its > content. > > If it looks like a pointer to an executable memory area (library, > executable, or > > JIT'd code), we keep it. Else, we zero this information (not needed). We > can > > then do a RLE-alike compression on the zeroes, so we can keep the layout > > of the stack after uncompression. > > > > > > Interesting! For comparison, here is a perf event [4] that shows there is > a lot of room for reducing the event size. We should check if discarding > other saved register values on the stack impacts restoring the instruction > pointer register. Doing the unwind offline also solves signal safety, > should be fast and scalable. > > > > Francis > > > > [1] http://www.nongnu.org/libunwind/man/unw_init_local(3).html > > [2] https://gist.github.com/giraldeau/98f08161e83a7ab800ea > > [3] https://sourceware.org/glibc/wiki/TLSandSignals > > [4] http://pastebin.com/sByfXXAQ > > _______________________________________________ > lttng-dev mailing list > [email protected] > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev > >
_______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
