Hi Lassi,
I'm on vacation till mid august with only sporadic access to email. Your use
case seems to be very similar to http://code.google.com/p/google-perftools/.
So I'm not surprised that you ran into these specific problems.

Thanks for sharing the patches. I'll take a look once I return from
vacation. Two comments:

* It should be possible to build a lock free, async signal safe
dl_iterate_phdr
* Did you have libatomic_ops installed? It makes a significant difference to
libunwind performance.

 -Arun

On Mon, Jul 6, 2009 at 4:24 AM, Lassi Tuura <[email protected]> wrote:

> Hi,
>
> I am one of the maintainers of a certain performance and memory use
> profiling package. We've worked hard at providing a low-overhead profiler
> for big software, and think we have done a fairly decent job on IA32 Linux.
> (It's called igprof.)
>
> In order to support x86-64, I've looked at using libunwind 0.99.
>
> It seems to work mostly for us, which is a relief, but I have a few
> concerns and patches and would really welcome your feedback.
>
> #1) libunwind seems to be reliable but not 100% async signal safe. In
> particular if called from signal handler (SIGPROF) at an inopportune time it
> may dead-lock. Specifically, if we get a profiling signal exactly when
> dynamic linker is inside pthread_mutex_* or is already holding a lock, and
> libunwind calls into dl_iterate_phdr() (NB; from the same thread already
> holding a lock or trying to change it), bad things will happen, usually a
> dead-lock.
>
> I'm currently entertaining the theory that either a crash from walking the
> elf headers in memory (without dl_iterate_phdr() and its locks) is less
> likely to crash than dead-locking inside the dynamic linker, or I should try
> to discard profile signals while inside the dynamic linker.
>
> Thoughts?
>
> #2) libunwind appears to make heavy use of sigprocmask(), e.g. around every
> mutex lock/unlock operation. This causes a colossal slow-down with two
> syscalls per trip. Removing those syscalls makes a massive performance
> difference, but I assume they were there for a reason? (Included in patch
> 2.)
>
> #3) libunwind resorts to scanning /proc/self/maps if it can't find an
> executable elf image (find_binary_for_address). On a platform we use a lot
> (RHEL4-based) this happens always for the main executable, causing a trip to
> scan /proc/self/maps for every stack level in the main program, which is,
> ahem, slow :-) I made a hack-ish (not thread safe) fix for this: one to
> determine the value just once, and for another, using
> readlink(/proc/self/exe) instead of scanning /proc/self/maps. I've never
> seen a null pointer for anything other than the main program. Can that
> really happen in other circumstances? Do you do it the present way because
> of some JIT situation? (Included in patch 2.)
>
> #4) We appear to be blessed with a lot of libraries which have insufficient
> dwarf unwind info, and by the looks of it, the RHEL4 GLIBC in particular. It
> looks like dwarf_find_save_locs() caches recent uses, but falls back on a
> full search if not found in cache. It turned out in our case the cache was
> hardly used, because it didn't cache negative results. I added some code to
> rs_new() to remember "bogus" (negative) results, and code in
> dwarf_find_save_locs() to cache negative replies from fetch_proc_info() and
> create_state_record_for(), and it did help performance a lot. However I am
> really unsure if I did it correctly and would appreciate another pair of
> eyes over this. (Included in patch 2, at the bottom.)
>
> With all these changes libunwind is better, but we still have very heavy
> penalty for unwinding; it's about 150 times slower than our IA-32 unwinding
> code, and about 5-10 times slower in real-world apps (tracking every memory
> allocation at ~ 1M allocations per second). I do realise x86-64 is much
> harder to unwind, but I am looking for ways to optimise it further; some
> ideas below.
>
> I realise our situation is a little bit pushing it, what with multiple
> threads and ~200 MB of code and VSIZE of 1-3GB.
>
> Patches:
>
> 1) C++ compatibility clean-up
>
> http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-
> cleanup.patch?revision=1.1&view=markup
>
> 2) Various performance optimisations
>
> http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-
> optimise.patch?revision=1.5&view=markup
>
> Ideas:
>
> 1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE).
>
> 2) Try to determine which frames are "varying", i.e. uses VLAs or alloca().
> If there are none in the call stack, just cache incoming CFA vs. outgoing
> CFA difference for every call site, and unwind that way just that way.
> Otherwise revert to slow unwind at least until you get past the varying
> frames. Specifically, walk from the top, probing a cache for CFA delta +
> varying marker. If you make it all the way to the top with the cache, return
> call stack. If not, switch back to normal slow unwind, update cache, and go
> all the way to top.  Alas, I have currently no idea how to identify
> alloca/vla-using frames. Any ideas?
>
> Lassi
>
>
> _______________________________________________
> Libunwind-devel mailing list
> [email protected]
> http://lists.nongnu.org/mailman/listinfo/libunwind-devel
>
_______________________________________________
Libunwind-devel mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/libunwind-devel

Reply via email to