Re: Should we build a new in-process unwind library?

2014-01-05 Thread Benoit Girard
My goal is to make SPS's stack easier to grab. SPS provides native stack
(if possible) plus pseudo stack. So it generally has more data then just a
native stack and is much more portable.

That being said making the unwind library independent is a win-win.


On Thu, Jan 2, 2014 at 11:03 AM, Jim Chen nc...@mozilla.com wrote:

 This is great! Thanks for working on it! Can the new library be used
 independently outside of SPS? For hang detection (bug 909974) we'd
 like to have the ability to unwind the hung thread's stack, and this
 library would be perfect for that.

 Cheers,
 Jim


 On 12/19/13 2:04 PM, Julian Seward wrote:
 
  Here's a status update.
 
  Recap: I proposed to create a new library to do fast in process
  CFI and EXIDX based stack unwinding, so as to be able to profile
  on tablets and phones with less overhead than using Breakpad.
 
  The library now exists.  Tracking bug is 938157.  It does
  CFI unwinding on x86_64-linux and arm-android, EXIDX on arm-android,
  and stack scanning (as a last resort) on both.  Initial
  integration with SPS has been completed and it produces plausible
  profiles at least on x86_64-linux.
 
  Compared with the best Breakpad based schemes, this library gives
  easily a factor of 10 cost reduction for CFI unwinding.  My best
  attempts with Breakpad achieved a cost of about 6600 insns/frame
  on x86_64-linux.  The new library comes in at around 470 insns/frame,
  without much attempt at optimisation.
 
  It also facilitates implementation of the the kind of space-saving
  techniques documented in
 
 https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/
 
  J
 
  ___
  dev-platform mailing list
  dev-platform@lists.mozilla.org
  https://lists.mozilla.org/listinfo/dev-platform
 
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2014-01-02 Thread Jim Chen
This is great! Thanks for working on it! Can the new library be used
independently outside of SPS? For hang detection (bug 909974) we'd
like to have the ability to unwind the hung thread's stack, and this
library would be perfect for that.

Cheers,
Jim


On 12/19/13 2:04 PM, Julian Seward wrote:
 
 Here's a status update.
 
 Recap: I proposed to create a new library to do fast in process
 CFI and EXIDX based stack unwinding, so as to be able to profile
 on tablets and phones with less overhead than using Breakpad.
 
 The library now exists.  Tracking bug is 938157.  It does
 CFI unwinding on x86_64-linux and arm-android, EXIDX on arm-android,
 and stack scanning (as a last resort) on both.  Initial
 integration with SPS has been completed and it produces plausible
 profiles at least on x86_64-linux.
 
 Compared with the best Breakpad based schemes, this library gives
 easily a factor of 10 cost reduction for CFI unwinding.  My best
 attempts with Breakpad achieved a cost of about 6600 insns/frame
 on x86_64-linux.  The new library comes in at around 470 insns/frame,
 without much attempt at optimisation.
 
 It also facilitates implementation of the the kind of space-saving
 techniques documented in
 https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/
 
 J
 
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform
 
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Should we build a new in-process unwind library?

2013-08-30 Thread Julian Seward

For a while now, Fennec and Linux-desktop nightlies have had the
built-in SPS-based time profiling using native stack unwinding.  The
unwinding is done using our in-tree Breakpad copy, which includes a
bunch of speedup patches that are slowly making their way upstream.

What we have works, but is less than ideal in a number of ways:

(1) It's slow.  CFI/EXIDX unwinding with Breakpad costs about 6600
instructions/frame, or around 120+K/instructions per stack trace.
That makes it infeasible for high-frequency, low overhead
profiling.  By comparison, Valgrind's CFI based unwinder costs
around 220 instructions per frame -- that's 30 times fewer.

This posting

https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/
looks at the reasons behind the discrepancy.  In summary, Breakpad
is architected to be general, using inheritance (to handle
multiple architectures) and std::map (to handle arbitrary register
sets), both of which slow it down a lot.  It also does a bunch of
dynamic memory allocation which is difficult to get rid of in some
corner cases (obscure Win32 unwinding hacks)

(2) It has a prohibitively large footprint for Fennec and in
particular for FirefoxOS.  My rough measurements show it taking
about 100MB of memory to load the CFI unwind rules for libxul.so
on ARM.

(3) Breakpad is not multithreaded.  Currently SPS can sample multiple
threads, but those samples are processed (unwound) by a single
unwinder thread.  It would be clearly a big throughput advantage
to be able to throw multiple cores at the unwinding workload.

(4) (not directly related to Breakpad, but..) Current SPS samples
threads at some constant rate regardless of whether they are
running or blocked in syscalls.  The latter constitutes a big
waste of compute resources, since most threads are blocked in
syscalls most of the time.  It would be nice to devise an adaptive
strategy that samples blocked threads less often.

My feeling is that re-architecting Breakpad to fix (1), (2) and (3) is
going to be a difficult and messy exercise, which will essentially
throwing away the core of it and starting over.  Getting changes on
such a scale back upstream could be problematic too.

I am very tempted to create a new custom unwind library designed
specifically to support SPS.  It needs to be fast, lower-footprint,
and multithreaded.  Unlike Breakpad, it will -- at least initially --
avoid supporting all Tier 1 targets, and concentrate on the most
critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side
effect) {x86,x86_64}/Linux (via CFI).

The unwinding core loop and rule representation would be an all-new
implementation.  Those are the parts that are time and space critical,
but they are also relatively simple.  The CFI and EXIDX parsers are
something that could be imported from Breakpad, since they work, are
difficult to get right, are not performance critical, and there's no
point in reimplementing them.

I am not sure how to deal with (4).  Really it's completely unrelated
to the unwind mechanism -- it has to do with SPS' sampling policy at
the level above.  There have been some ideas floated around to detect
when a thread is in a syscall by looking at the instruction at the
program counter, since the enter the kernel instructions are
different from all other ones.  That doesn't give a way to avoid on
average half a long-interval delay of missed samples when a thread
leaves a syscall.

Comments, opinions?

J
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Robert O'Callahan
On Fri, Aug 30, 2013 at 8:58 PM, Julian Seward jsew...@acm.org wrote:

 I am not sure how to deal with (4).  Really it's completely unrelated
 to the unwind mechanism -- it has to do with SPS' sampling policy at
 the level above.  There have been some ideas floated around to detect
 when a thread is in a syscall by looking at the instruction at the
 program counter, since the enter the kernel instructions are
 different from all other ones.  That doesn't give a way to avoid on
 average half a long-interval delay of missed samples when a thread
 leaves a syscall.


Being able to sample uniformly whether or not the thread is in a syscall is
important sometimes, so let's not lose that.

To sample only non-blocked states can't we count CPU clocks in user space,
instead of the real-time clock, or something like that?

Rob
-- 
Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w  *
*
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Robert O'Callahan
On Fri, Aug 30, 2013 at 9:41 PM, Julian Seward jsew...@acm.org wrote:

 That said, AFAICS the core problem is that once a thread blocks and we drop
 the sampling rate, we have no obvious (direct, notification-based) way to
 know when it has resumed, and hence are in danger of losing high-frequency
 samples in gap between when it really resumed and when we (later) notice
 that it has.


There's ptrace, but that sucks in various ways :-).

Rob
-- 
Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w  *
*
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Jeff Muizelaar

On 2013-08-30, at 4:58 AM, Julian Seward wrote:

 I am very tempted to create a new custom unwind library designed
 specifically to support SPS.  It needs to be fast, lower-footprint,
 and multithreaded.  Unlike Breakpad, it will -- at least initially --
 avoid supporting all Tier 1 targets, and concentrate on the most
 critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side
 effect) {x86,x86_64}/Linux (via CFI).

Have you looked at libbacktrace? 
(https://github.com/mirrors/gcc/tree/master/libbacktrace)

Is it a more useful starting point?

-Jeff

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Nathan Froyd
- Original Message -
 On 08/30/2013 11:07 AM, Robert O'Callahan wrote:
  Being able to sample uniformly whether or not the thread is in a syscall is
  important sometimes, so let's not lose that.
 
 I am not proposing to lose that facility.  Am only proposing (really,
 merely parroting Taras' suggestion) that it would be nice to have the option
 of some kind of adaptive backoff for threads blocked in syscalls.

One way of handling this is the following: When you take a sample, replace the 
return address for the current stack frame with the address of a known function 
inside your profile (call this a trampoline function).  Now the call stack 
looks something like:

  A -- B -- C -- D --Trampoline-- E

where the sample was taken in E.  The trampoline function enables us to keep 
track of what some portion of the stack looks like without having to unwind it. 
 Unwinding on subsequent samples proceeds until we encounter a frame whose 
return address is the trampoline function.  At that point, we know everything 
else above that frame is the same as whatever it was the last time we took a 
sample.

There are now three interesting scenarios of what can happen before we take the 
next sample.

1) E does not return.  It may call other functions, it may do some computation, 
or it may just be blocked on a syscall.  Whatever the case, we take a sample, 
find that E's return address is our trampoline function, and stop.

2) E calls F calls G, and then a sample is taken.  Just prior to the sample, we 
have:

  A -- B -- C -- D --Trampoline-- E -- F -- G

The profiler fixes things up to do:

  A -- B -- C -- D -- E -- F --Trampoline-- G

and we go about our merry way.

3) E returns.  Instead of returning to D, it returns to our trampoline 
function.  Said function then replaces the return address of D's stack frame 
with the address of the trampoline function, and returns to the appropriate PC 
in D:

  A -- B -- C --Trampoline-- D

This can happen several times (D returns to C, etc.).

Depending on how deep the call stack is and how frequently we push and pop 
stack frames, being able to not unwind some number of frames can be a 
significant savings.

-Nathan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Ehsan Akhgari
On Fri, Aug 30, 2013 at 9:27 AM, Nathan Froyd froy...@mozilla.com wrote:

 - Original Message -
  On 08/30/2013 11:07 AM, Robert O'Callahan wrote:
   Being able to sample uniformly whether or not the thread is in a
 syscall is
   important sometimes, so let's not lose that.
 
  I am not proposing to lose that facility.  Am only proposing (really,
  merely parroting Taras' suggestion) that it would be nice to have the
 option
  of some kind of adaptive backoff for threads blocked in syscalls.

 One way of handling this is the following: When you take a sample, replace
 the return address for the current stack frame with the address of a known
 function inside your profile (call this a trampoline function).  Now the
 call stack looks something like:

   A -- B -- C -- D --Trampoline-- E

 where the sample was taken in E.  The trampoline function enables us to
 keep track of what some portion of the stack looks like without having to
 unwind it.  Unwinding on subsequent samples proceeds until we encounter a
 frame whose return address is the trampoline function.  At that point, we
 know everything else above that frame is the same as whatever it was the
 last time we took a sample.

 There are now three interesting scenarios of what can happen before we
 take the next sample.

 1) E does not return.  It may call other functions, it may do some
 computation, or it may just be blocked on a syscall.  Whatever the case, we
 take a sample, find that E's return address is our trampoline function, and
 stop.

 2) E calls F calls G, and then a sample is taken.  Just prior to the
 sample, we have:

   A -- B -- C -- D --Trampoline-- E -- F -- G

 The profiler fixes things up to do:

   A -- B -- C -- D -- E -- F --Trampoline-- G

 and we go about our merry way.

 3) E returns.  Instead of returning to D, it returns to our trampoline
 function.  Said function then replaces the return address of D's stack
 frame with the address of the trampoline function, and returns to the
 appropriate PC in D:

   A -- B -- C --Trampoline-- D

 This can happen several times (D returns to C, etc.).

 Depending on how deep the call stack is and how frequently we push and pop
 stack frames, being able to not unwind some number of frames can be a
 significant savings.


Isn't this assuming that the stack frames are unique?  How would you handle
recursion under this scheme?

Cheers,

--
Ehsan
http://ehsanakhgari.org/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Nathan Froyd
- Original Message - 
  Depending on how deep the call stack is and how frequently we push and pop
  stack frames, being able to not unwind some number of frames can be a
  significant savings.
 
 Isn't this assuming that the stack frames are unique? How would you handle
 recursion under this scheme?

I do not understand the question.  Recursive calls get new stack frames.  Stack 
frames from recursive calls and stack frames from non-recursive calls are 
handled identically.

Maybe you meant tail calls?  Tail calls require no special treatment.

-Nathan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Ehsan Akhgari

On 2013-08-30 10:16 AM, Nathan Froyd wrote:

- Original Message -

Depending on how deep the call stack is and how frequently we push and pop
stack frames, being able to not unwind some number of frames can be a
significant savings.


Isn't this assuming that the stack frames are unique? How would you handle
recursion under this scheme?


I do not understand the question.  Recursive calls get new stack frames.  Stack 
frames from recursive calls and stack frames from non-recursive calls are 
handled identically.


Nevermind, I think your solution is talking about function addresses, 
not stack frames.


Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Ehsan Akhgari

On 2013-08-30 11:07 AM, Nathan Froyd wrote:

- Original Message -

Nevermind, I think your solution is talking about function addresses,
not stack frames.


If you think my solution is talking about function addresses, then I have 
explained something poorly.  Why do you think it is talking about function 
addresses?


I misread it, it was my fault!

Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Should we build a new in-process unwind library?

2013-08-30 Thread Steve Fink
On 08/30/2013 06:27 AM, Nathan Froyd wrote:
 - Original Message -
 On 08/30/2013 11:07 AM, Robert O'Callahan wrote:
 Being able to sample uniformly whether or not the thread is in a syscall is
 important sometimes, so let's not lose that.
 I am not proposing to lose that facility.  Am only proposing (really,
 merely parroting Taras' suggestion) that it would be nice to have the option
 of some kind of adaptive backoff for threads blocked in syscalls.
 One way of handling this is the following: When you take a sample, replace 
 the return address for the current stack frame with the address of a known 
 function inside your profile (call this a trampoline function).  Now the 
 call stack looks something like:

Another approach with similar benefits is what I was thinking of in my
comment to jseward's blog post. I implemented something very similar as
an optimization for our dynamic rooting analysis, which has similar
stackwalking requirements.

In my scheme, you periodically push markers onto the stack. The
existing SPS markers would be good for this, though we might want to
insert more for this purpose. They will end up scattered throughout the
real stack. Chain the markers together in a simple linked list, with a
global or thread-local list header updated separately on every push/pop.
(I think SPS markers already do this.) Keep a seen status bit in each
marker, initialized to 0 when the marker is pushed.

When sampling, scan up the seen bits, flipping them from 0 to 1 until
you hit the first that is already 1. That marks the deepest point you
need to search in the stack, since you've already seen the other frames.

Note that if you take a sample, pop a couple of frames, push those same
frames back, and take another sample, this scheme will force you to
re-scan those new frames again. On the bright side, this allows you to
distinguish this case from the more usual case of those frames never
having been popped in between samples. But only at a marker granularity;
I don't know if this information can be usefully surfaced in a UI or not.

This scheme would reduce the cost of scanning sleeping threads, since
you'd only need to scan up to the first marker.

The major drawback of this scheme is that it relies on manually placing
markers in the code so that they occur frequently enough in the stack to
be useful. That has side benefits too, though, since you can use
semantic labeling, a la SPS markers, to better categorize the work being
done.

Still, if you want to lift this restriction, you could combine the
Finkish and Froydulent schemes: in addition to the linked list of
markers, you also tag native frames during a sample by replacing their
return addresses with jumps to trampolines. (The trampoline in this case
is much simpler: it's just a direct jump to the original return
address.) You can now derive a per-frame seen bit by checking whether
it's within your trampoline buffer's address range. It's automatically
and implicitly 0 when pushing a frame normally, set to 1 during sampling
by swapping in the trampoline address, and deleted when the frame is
popped. No special logic to massage the stack is required, as in the
Froydulent trampolines.

One drawback of any trampoline approach is that it would confuse
debuggers. I'm not sure if there's a way with CFI to label the whole
trampoline region with the rule pull the return address out of the
operand to this jump instruction, and leave all registers as-is.
Similarly, Breakpad and any other stack unwinders would need to be
taught to deal with them.

A potential advantage of manual markers is that they can serve as
synchronization points in your stackwalk -- if your native frame walk
gets lost somewhere, you can store enough info in the markers to be able
to pick things up again at the next marker's frame. (I'm skeptical that
this would be worth the effort, though -- the unwind state you'd need to
save is platform-specific and hairy. You might be better off turning on
exception unwind info and relying on that.)

With this in place, a sleeping thread sample would be walked once, and
then on the next sample only the first return address would be considered.

Finally, one minor wrinkle with manual markers that I hit when
implementing this for the rooting analysis -- if you have multiple
markers in one frame, you can't predict which one the compiler will put
on the stack first. So when scanning in linked list order, you're ok --
all seen markers will be visited before all unseen markers. But that
order may not be in monotonic address order. And you may get multiple
markers in one frame from inlining, so it's not always easy to predict
or avoid. Maybe that doesn't matter for this application?


   A -- B -- C -- D --Trampoline-- E

 where the sample was taken in E.  The trampoline function enables us to keep 
 track of what some portion of the stack looks like without having to unwind 
 it.  Unwinding on subsequent samples proceeds until we encounter a frame