Re: Should we build a new in-process unwind library?
My goal is to make SPS's stack easier to grab. SPS provides native stack (if possible) plus pseudo stack. So it generally has more data then just a native stack and is much more portable. That being said making the unwind library independent is a win-win. On Thu, Jan 2, 2014 at 11:03 AM, Jim Chen nc...@mozilla.com wrote: This is great! Thanks for working on it! Can the new library be used independently outside of SPS? For hang detection (bug 909974) we'd like to have the ability to unwind the hung thread's stack, and this library would be perfect for that. Cheers, Jim On 12/19/13 2:04 PM, Julian Seward wrote: Here's a status update. Recap: I proposed to create a new library to do fast in process CFI and EXIDX based stack unwinding, so as to be able to profile on tablets and phones with less overhead than using Breakpad. The library now exists. Tracking bug is 938157. It does CFI unwinding on x86_64-linux and arm-android, EXIDX on arm-android, and stack scanning (as a last resort) on both. Initial integration with SPS has been completed and it produces plausible profiles at least on x86_64-linux. Compared with the best Breakpad based schemes, this library gives easily a factor of 10 cost reduction for CFI unwinding. My best attempts with Breakpad achieved a cost of about 6600 insns/frame on x86_64-linux. The new library comes in at around 470 insns/frame, without much attempt at optimisation. It also facilitates implementation of the the kind of space-saving techniques documented in https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/ J ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
This is great! Thanks for working on it! Can the new library be used independently outside of SPS? For hang detection (bug 909974) we'd like to have the ability to unwind the hung thread's stack, and this library would be perfect for that. Cheers, Jim On 12/19/13 2:04 PM, Julian Seward wrote: Here's a status update. Recap: I proposed to create a new library to do fast in process CFI and EXIDX based stack unwinding, so as to be able to profile on tablets and phones with less overhead than using Breakpad. The library now exists. Tracking bug is 938157. It does CFI unwinding on x86_64-linux and arm-android, EXIDX on arm-android, and stack scanning (as a last resort) on both. Initial integration with SPS has been completed and it produces plausible profiles at least on x86_64-linux. Compared with the best Breakpad based schemes, this library gives easily a factor of 10 cost reduction for CFI unwinding. My best attempts with Breakpad achieved a cost of about 6600 insns/frame on x86_64-linux. The new library comes in at around 470 insns/frame, without much attempt at optimisation. It also facilitates implementation of the the kind of space-saving techniques documented in https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/ J ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Should we build a new in-process unwind library?
For a while now, Fennec and Linux-desktop nightlies have had the built-in SPS-based time profiling using native stack unwinding. The unwinding is done using our in-tree Breakpad copy, which includes a bunch of speedup patches that are slowly making their way upstream. What we have works, but is less than ideal in a number of ways: (1) It's slow. CFI/EXIDX unwinding with Breakpad costs about 6600 instructions/frame, or around 120+K/instructions per stack trace. That makes it infeasible for high-frequency, low overhead profiling. By comparison, Valgrind's CFI based unwinder costs around 220 instructions per frame -- that's 30 times fewer. This posting https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/ looks at the reasons behind the discrepancy. In summary, Breakpad is architected to be general, using inheritance (to handle multiple architectures) and std::map (to handle arbitrary register sets), both of which slow it down a lot. It also does a bunch of dynamic memory allocation which is difficult to get rid of in some corner cases (obscure Win32 unwinding hacks) (2) It has a prohibitively large footprint for Fennec and in particular for FirefoxOS. My rough measurements show it taking about 100MB of memory to load the CFI unwind rules for libxul.so on ARM. (3) Breakpad is not multithreaded. Currently SPS can sample multiple threads, but those samples are processed (unwound) by a single unwinder thread. It would be clearly a big throughput advantage to be able to throw multiple cores at the unwinding workload. (4) (not directly related to Breakpad, but..) Current SPS samples threads at some constant rate regardless of whether they are running or blocked in syscalls. The latter constitutes a big waste of compute resources, since most threads are blocked in syscalls most of the time. It would be nice to devise an adaptive strategy that samples blocked threads less often. My feeling is that re-architecting Breakpad to fix (1), (2) and (3) is going to be a difficult and messy exercise, which will essentially throwing away the core of it and starting over. Getting changes on such a scale back upstream could be problematic too. I am very tempted to create a new custom unwind library designed specifically to support SPS. It needs to be fast, lower-footprint, and multithreaded. Unlike Breakpad, it will -- at least initially -- avoid supporting all Tier 1 targets, and concentrate on the most critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side effect) {x86,x86_64}/Linux (via CFI). The unwinding core loop and rule representation would be an all-new implementation. Those are the parts that are time and space critical, but they are also relatively simple. The CFI and EXIDX parsers are something that could be imported from Breakpad, since they work, are difficult to get right, are not performance critical, and there's no point in reimplementing them. I am not sure how to deal with (4). Really it's completely unrelated to the unwind mechanism -- it has to do with SPS' sampling policy at the level above. There have been some ideas floated around to detect when a thread is in a syscall by looking at the instruction at the program counter, since the enter the kernel instructions are different from all other ones. That doesn't give a way to avoid on average half a long-interval delay of missed samples when a thread leaves a syscall. Comments, opinions? J ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On Fri, Aug 30, 2013 at 8:58 PM, Julian Seward jsew...@acm.org wrote: I am not sure how to deal with (4). Really it's completely unrelated to the unwind mechanism -- it has to do with SPS' sampling policy at the level above. There have been some ideas floated around to detect when a thread is in a syscall by looking at the instruction at the program counter, since the enter the kernel instructions are different from all other ones. That doesn't give a way to avoid on average half a long-interval delay of missed samples when a thread leaves a syscall. Being able to sample uniformly whether or not the thread is in a syscall is important sometimes, so let's not lose that. To sample only non-blocked states can't we count CPU clocks in user space, instead of the real-time clock, or something like that? Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w * * ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On Fri, Aug 30, 2013 at 9:41 PM, Julian Seward jsew...@acm.org wrote: That said, AFAICS the core problem is that once a thread blocks and we drop the sampling rate, we have no obvious (direct, notification-based) way to know when it has resumed, and hence are in danger of losing high-frequency samples in gap between when it really resumed and when we (later) notice that it has. There's ptrace, but that sucks in various ways :-). Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w * * ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On 2013-08-30, at 4:58 AM, Julian Seward wrote: I am very tempted to create a new custom unwind library designed specifically to support SPS. It needs to be fast, lower-footprint, and multithreaded. Unlike Breakpad, it will -- at least initially -- avoid supporting all Tier 1 targets, and concentrate on the most critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side effect) {x86,x86_64}/Linux (via CFI). Have you looked at libbacktrace? (https://github.com/mirrors/gcc/tree/master/libbacktrace) Is it a more useful starting point? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
- Original Message - On 08/30/2013 11:07 AM, Robert O'Callahan wrote: Being able to sample uniformly whether or not the thread is in a syscall is important sometimes, so let's not lose that. I am not proposing to lose that facility. Am only proposing (really, merely parroting Taras' suggestion) that it would be nice to have the option of some kind of adaptive backoff for threads blocked in syscalls. One way of handling this is the following: When you take a sample, replace the return address for the current stack frame with the address of a known function inside your profile (call this a trampoline function). Now the call stack looks something like: A -- B -- C -- D --Trampoline-- E where the sample was taken in E. The trampoline function enables us to keep track of what some portion of the stack looks like without having to unwind it. Unwinding on subsequent samples proceeds until we encounter a frame whose return address is the trampoline function. At that point, we know everything else above that frame is the same as whatever it was the last time we took a sample. There are now three interesting scenarios of what can happen before we take the next sample. 1) E does not return. It may call other functions, it may do some computation, or it may just be blocked on a syscall. Whatever the case, we take a sample, find that E's return address is our trampoline function, and stop. 2) E calls F calls G, and then a sample is taken. Just prior to the sample, we have: A -- B -- C -- D --Trampoline-- E -- F -- G The profiler fixes things up to do: A -- B -- C -- D -- E -- F --Trampoline-- G and we go about our merry way. 3) E returns. Instead of returning to D, it returns to our trampoline function. Said function then replaces the return address of D's stack frame with the address of the trampoline function, and returns to the appropriate PC in D: A -- B -- C --Trampoline-- D This can happen several times (D returns to C, etc.). Depending on how deep the call stack is and how frequently we push and pop stack frames, being able to not unwind some number of frames can be a significant savings. -Nathan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On Fri, Aug 30, 2013 at 9:27 AM, Nathan Froyd froy...@mozilla.com wrote: - Original Message - On 08/30/2013 11:07 AM, Robert O'Callahan wrote: Being able to sample uniformly whether or not the thread is in a syscall is important sometimes, so let's not lose that. I am not proposing to lose that facility. Am only proposing (really, merely parroting Taras' suggestion) that it would be nice to have the option of some kind of adaptive backoff for threads blocked in syscalls. One way of handling this is the following: When you take a sample, replace the return address for the current stack frame with the address of a known function inside your profile (call this a trampoline function). Now the call stack looks something like: A -- B -- C -- D --Trampoline-- E where the sample was taken in E. The trampoline function enables us to keep track of what some portion of the stack looks like without having to unwind it. Unwinding on subsequent samples proceeds until we encounter a frame whose return address is the trampoline function. At that point, we know everything else above that frame is the same as whatever it was the last time we took a sample. There are now three interesting scenarios of what can happen before we take the next sample. 1) E does not return. It may call other functions, it may do some computation, or it may just be blocked on a syscall. Whatever the case, we take a sample, find that E's return address is our trampoline function, and stop. 2) E calls F calls G, and then a sample is taken. Just prior to the sample, we have: A -- B -- C -- D --Trampoline-- E -- F -- G The profiler fixes things up to do: A -- B -- C -- D -- E -- F --Trampoline-- G and we go about our merry way. 3) E returns. Instead of returning to D, it returns to our trampoline function. Said function then replaces the return address of D's stack frame with the address of the trampoline function, and returns to the appropriate PC in D: A -- B -- C --Trampoline-- D This can happen several times (D returns to C, etc.). Depending on how deep the call stack is and how frequently we push and pop stack frames, being able to not unwind some number of frames can be a significant savings. Isn't this assuming that the stack frames are unique? How would you handle recursion under this scheme? Cheers, -- Ehsan http://ehsanakhgari.org/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
- Original Message - Depending on how deep the call stack is and how frequently we push and pop stack frames, being able to not unwind some number of frames can be a significant savings. Isn't this assuming that the stack frames are unique? How would you handle recursion under this scheme? I do not understand the question. Recursive calls get new stack frames. Stack frames from recursive calls and stack frames from non-recursive calls are handled identically. Maybe you meant tail calls? Tail calls require no special treatment. -Nathan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On 2013-08-30 10:16 AM, Nathan Froyd wrote: - Original Message - Depending on how deep the call stack is and how frequently we push and pop stack frames, being able to not unwind some number of frames can be a significant savings. Isn't this assuming that the stack frames are unique? How would you handle recursion under this scheme? I do not understand the question. Recursive calls get new stack frames. Stack frames from recursive calls and stack frames from non-recursive calls are handled identically. Nevermind, I think your solution is talking about function addresses, not stack frames. Ehsan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On 2013-08-30 11:07 AM, Nathan Froyd wrote: - Original Message - Nevermind, I think your solution is talking about function addresses, not stack frames. If you think my solution is talking about function addresses, then I have explained something poorly. Why do you think it is talking about function addresses? I misread it, it was my fault! Ehsan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Should we build a new in-process unwind library?
On 08/30/2013 06:27 AM, Nathan Froyd wrote: - Original Message - On 08/30/2013 11:07 AM, Robert O'Callahan wrote: Being able to sample uniformly whether or not the thread is in a syscall is important sometimes, so let's not lose that. I am not proposing to lose that facility. Am only proposing (really, merely parroting Taras' suggestion) that it would be nice to have the option of some kind of adaptive backoff for threads blocked in syscalls. One way of handling this is the following: When you take a sample, replace the return address for the current stack frame with the address of a known function inside your profile (call this a trampoline function). Now the call stack looks something like: Another approach with similar benefits is what I was thinking of in my comment to jseward's blog post. I implemented something very similar as an optimization for our dynamic rooting analysis, which has similar stackwalking requirements. In my scheme, you periodically push markers onto the stack. The existing SPS markers would be good for this, though we might want to insert more for this purpose. They will end up scattered throughout the real stack. Chain the markers together in a simple linked list, with a global or thread-local list header updated separately on every push/pop. (I think SPS markers already do this.) Keep a seen status bit in each marker, initialized to 0 when the marker is pushed. When sampling, scan up the seen bits, flipping them from 0 to 1 until you hit the first that is already 1. That marks the deepest point you need to search in the stack, since you've already seen the other frames. Note that if you take a sample, pop a couple of frames, push those same frames back, and take another sample, this scheme will force you to re-scan those new frames again. On the bright side, this allows you to distinguish this case from the more usual case of those frames never having been popped in between samples. But only at a marker granularity; I don't know if this information can be usefully surfaced in a UI or not. This scheme would reduce the cost of scanning sleeping threads, since you'd only need to scan up to the first marker. The major drawback of this scheme is that it relies on manually placing markers in the code so that they occur frequently enough in the stack to be useful. That has side benefits too, though, since you can use semantic labeling, a la SPS markers, to better categorize the work being done. Still, if you want to lift this restriction, you could combine the Finkish and Froydulent schemes: in addition to the linked list of markers, you also tag native frames during a sample by replacing their return addresses with jumps to trampolines. (The trampoline in this case is much simpler: it's just a direct jump to the original return address.) You can now derive a per-frame seen bit by checking whether it's within your trampoline buffer's address range. It's automatically and implicitly 0 when pushing a frame normally, set to 1 during sampling by swapping in the trampoline address, and deleted when the frame is popped. No special logic to massage the stack is required, as in the Froydulent trampolines. One drawback of any trampoline approach is that it would confuse debuggers. I'm not sure if there's a way with CFI to label the whole trampoline region with the rule pull the return address out of the operand to this jump instruction, and leave all registers as-is. Similarly, Breakpad and any other stack unwinders would need to be taught to deal with them. A potential advantage of manual markers is that they can serve as synchronization points in your stackwalk -- if your native frame walk gets lost somewhere, you can store enough info in the markers to be able to pick things up again at the next marker's frame. (I'm skeptical that this would be worth the effort, though -- the unwind state you'd need to save is platform-specific and hairy. You might be better off turning on exception unwind info and relying on that.) With this in place, a sleeping thread sample would be walked once, and then on the next sample only the first return address would be considered. Finally, one minor wrinkle with manual markers that I hit when implementing this for the rooting analysis -- if you have multiple markers in one frame, you can't predict which one the compiler will put on the stack first. So when scanning in linked list order, you're ok -- all seen markers will be visited before all unseen markers. But that order may not be in monotonic address order. And you may get multiple markers in one frame from inlining, so it's not always easy to predict or avoid. Maybe that doesn't matter for this application? A -- B -- C -- D --Trampoline-- E where the sample was taken in E. The trampoline function enables us to keep track of what some portion of the stack looks like without having to unwind it. Unwinding on subsequent samples proceeds until we encounter a frame