Hi all,
(please CC me in replies, not a list member)
I have a large C++ app that throws exceptions to unwind anywhere from
5-20 stack frames when an error prevents the request from being served
(which happens rather frequently). Works fine single-threaded, but
performance is terrible for 24 threads on a 48-thread Ubuntu 10 machine.
Profiling points to a global mutex acquire in __GI___dl_iterate_phdr as
the culprit, with _Unwind_Find_FDE as its caller.
Tracing the attached test case with the attached gdb script shows that
the -DDTOR case executes ~20k instructions during unwind and calls
iterate_phdr 12 times. The -DTRY case executes ~33k instructions and
calls iterate_phdr 18 times. The exception in this test case only
affects three stack frames, with minimal cleanup required, and the trace
is taken on the second call to the function that swallows the error, to
warm up libgcc's internal caches [1].
The instruction counts aren't terribly surprising---I know unwinding is
complex---but might it be possible to throw and catch a previously-seen
exception through a previously-seen stack trace, with something fewer
than 4-6 global mutex acquires for each frame unwound? As it stands, the
deeper the stack trace (= the more desirable to throw rather than return
an error), the more of a scalability bottleneck unwinding becomes. My
actual app would apparently suffer anywhere from 25 to 80 global mutex
acquires for each exception thrown, which probably explains why the
bottleneck arises...
I'm bringing the issue up here, rather than filing a bug, because I'm
not sure whether this is an oversight, a known problem that's hard to
fix, or a feature (e.g. somehow required for reliable unwinding). I
suspect the former, because _Unwind_Find_FDE tries a call to
_Unwind_Find_registered_FDE before falling back to dl_iterate_phdr, but
the former never succeeds in my trace (iterate_phdr is always called).
FWIW, I've tested both gcc-4.6 and 4.8 but see no meaningful difference
between them.
[1] The cache can be seen in libgcc/unwind-dw2-fde-dip.c, though they
will do little to prevent mutex bottlenecks because they're accessed
from the iterate_phdr callback, behind the mutex acuqire.
Thoughts?
Ryan
#include <cstdio>
void ding() { fputs("Ding!\n", stderr); }
struct ding_unless {
bool commit;
ding_unless() : commit(false) { }
~ding_unless() { if (not commit) ding(); }
};
void __attribute__((noinline)) sentinel() { printf("Done\n"); }
int __attribute__((noinline)) foo() { throw 42; }
int __attribute__((noinline)) bar() {
#ifdef TRY
try { return 1+foo(); }
catch (...) { ding(); throw; }
#elif defined(DTOR)
ding_unless x;
int ans = 1+foo();
x.commit = true;
return ans;
#endif
}
int __attribute__((noinline)) baz() {
int ans;
try { ans = 1+bar(); }
catch (...) { ans = -1; }
sentinel();
return ans;
}
int main() {
baz();
baz();
return 0;
}
b baz
r
c
display/i $pc
si
while ($pc != sentinel)
si
end
k
q