For a while now, Fennec and Linux-desktop nightlies have had the
built-in SPS-based time profiling using native stack unwinding.  The
unwinding is done using our in-tree Breakpad copy, which includes a
bunch of speedup patches that are slowly making their way upstream.

What we have works, but is less than ideal in a number of ways:

(1) It's slow.  CFI/EXIDX unwinding with Breakpad costs about 6600
    instructions/frame, or around 120+K/instructions per stack trace.
    That makes it infeasible for high-frequency, low overhead
    profiling.  By comparison, Valgrind's CFI based unwinder costs
    around 220 instructions per frame -- that's 30 times fewer.

    This posting

https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/
    looks at the reasons behind the discrepancy.  In summary, Breakpad
    is architected to be general, using inheritance (to handle
    multiple architectures) and std::map (to handle arbitrary register
    sets), both of which slow it down a lot.  It also does a bunch of
    dynamic memory allocation which is difficult to get rid of in some
    corner cases (obscure Win32 unwinding hacks)

(2) It has a prohibitively large footprint for Fennec and in
    particular for FirefoxOS.  My rough measurements show it taking
    about 100MB of memory to load the CFI unwind rules for libxul.so
    on ARM.

(3) Breakpad is not multithreaded.  Currently SPS can sample multiple
    threads, but those samples are processed (unwound) by a single
    unwinder thread.  It would be clearly a big throughput advantage
    to be able to throw multiple cores at the unwinding workload.

(4) (not directly related to Breakpad, but..) Current SPS samples
    threads at some constant rate regardless of whether they are
    running or blocked in syscalls.  The latter constitutes a big
    waste of compute resources, since most threads are blocked in
    syscalls most of the time.  It would be nice to devise an adaptive
    strategy that samples blocked threads less often.

My feeling is that re-architecting Breakpad to fix (1), (2) and (3) is
going to be a difficult and messy exercise, which will essentially
throwing away the core of it and starting over.  Getting changes on
such a scale back upstream could be problematic too.

I am very tempted to create a new custom unwind library designed
specifically to support SPS.  It needs to be fast, lower-footprint,
and multithreaded.  Unlike Breakpad, it will -- at least initially --
avoid supporting all Tier 1 targets, and concentrate on the most
critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side
effect) {x86,x86_64}/Linux (via CFI).

The unwinding core loop and rule representation would be an all-new
implementation.  Those are the parts that are time and space critical,
but they are also relatively simple.  The CFI and EXIDX parsers are
something that could be imported from Breakpad, since they work, are
difficult to get right, are not performance critical, and there's no
point in reimplementing them.

I am not sure how to deal with (4).  Really it's completely unrelated
to the unwind mechanism -- it has to do with SPS' sampling policy at
the level above.  There have been some ideas floated around to detect
when a thread is in a syscall by looking at the instruction at the
program counter, since the "enter the kernel" instructions are
different from all other ones.  That doesn't give a way to avoid on
average half a long-interval delay of missed samples when a thread
leaves a syscall.

Comments, opinions?

J
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to