Hi Lassi, I'm on vacation till mid august with only sporadic access to email. Your use case seems to be very similar to http://code.google.com/p/google-perftools/. So I'm not surprised that you ran into these specific problems.
Thanks for sharing the patches. I'll take a look once I return from vacation. Two comments: * It should be possible to build a lock free, async signal safe dl_iterate_phdr * Did you have libatomic_ops installed? It makes a significant difference to libunwind performance. -Arun On Mon, Jul 6, 2009 at 4:24 AM, Lassi Tuura <[email protected]> wrote: > Hi, > > I am one of the maintainers of a certain performance and memory use > profiling package. We've worked hard at providing a low-overhead profiler > for big software, and think we have done a fairly decent job on IA32 Linux. > (It's called igprof.) > > In order to support x86-64, I've looked at using libunwind 0.99. > > It seems to work mostly for us, which is a relief, but I have a few > concerns and patches and would really welcome your feedback. > > #1) libunwind seems to be reliable but not 100% async signal safe. In > particular if called from signal handler (SIGPROF) at an inopportune time it > may dead-lock. Specifically, if we get a profiling signal exactly when > dynamic linker is inside pthread_mutex_* or is already holding a lock, and > libunwind calls into dl_iterate_phdr() (NB; from the same thread already > holding a lock or trying to change it), bad things will happen, usually a > dead-lock. > > I'm currently entertaining the theory that either a crash from walking the > elf headers in memory (without dl_iterate_phdr() and its locks) is less > likely to crash than dead-locking inside the dynamic linker, or I should try > to discard profile signals while inside the dynamic linker. > > Thoughts? > > #2) libunwind appears to make heavy use of sigprocmask(), e.g. around every > mutex lock/unlock operation. This causes a colossal slow-down with two > syscalls per trip. Removing those syscalls makes a massive performance > difference, but I assume they were there for a reason? (Included in patch > 2.) > > #3) libunwind resorts to scanning /proc/self/maps if it can't find an > executable elf image (find_binary_for_address). On a platform we use a lot > (RHEL4-based) this happens always for the main executable, causing a trip to > scan /proc/self/maps for every stack level in the main program, which is, > ahem, slow :-) I made a hack-ish (not thread safe) fix for this: one to > determine the value just once, and for another, using > readlink(/proc/self/exe) instead of scanning /proc/self/maps. I've never > seen a null pointer for anything other than the main program. Can that > really happen in other circumstances? Do you do it the present way because > of some JIT situation? (Included in patch 2.) > > #4) We appear to be blessed with a lot of libraries which have insufficient > dwarf unwind info, and by the looks of it, the RHEL4 GLIBC in particular. It > looks like dwarf_find_save_locs() caches recent uses, but falls back on a > full search if not found in cache. It turned out in our case the cache was > hardly used, because it didn't cache negative results. I added some code to > rs_new() to remember "bogus" (negative) results, and code in > dwarf_find_save_locs() to cache negative replies from fetch_proc_info() and > create_state_record_for(), and it did help performance a lot. However I am > really unsure if I did it correctly and would appreciate another pair of > eyes over this. (Included in patch 2, at the bottom.) > > With all these changes libunwind is better, but we still have very heavy > penalty for unwinding; it's about 150 times slower than our IA-32 unwinding > code, and about 5-10 times slower in real-world apps (tracking every memory > allocation at ~ 1M allocations per second). I do realise x86-64 is much > harder to unwind, but I am looking for ways to optimise it further; some > ideas below. > > I realise our situation is a little bit pushing it, what with multiple > threads and ~200 MB of code and VSIZE of 1-3GB. > > Patches: > > 1) C++ compatibility clean-up > > http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- > cleanup.patch?revision=1.1&view=markup > > 2) Various performance optimisations > > http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- > optimise.patch?revision=1.5&view=markup > > Ideas: > > 1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE). > > 2) Try to determine which frames are "varying", i.e. uses VLAs or alloca(). > If there are none in the call stack, just cache incoming CFA vs. outgoing > CFA difference for every call site, and unwind that way just that way. > Otherwise revert to slow unwind at least until you get past the varying > frames. Specifically, walk from the top, probing a cache for CFA delta + > varying marker. If you make it all the way to the top with the cache, return > call stack. If not, switch back to normal slow unwind, update cache, and go > all the way to top. Alas, I have currently no idea how to identify > alloca/vla-using frames. Any ideas? > > Lassi > > > _______________________________________________ > Libunwind-devel mailing list > [email protected] > http://lists.nongnu.org/mailman/listinfo/libunwind-devel >
_______________________________________________ Libunwind-devel mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/libunwind-devel
