On Fri, Aug 23, 2019 at 1:25 PM, Mark Wielaard <m...@klomp.org> wrote:
Hi,
On Wed, 2019-08-21 at 09:08 -0500, Jonathon Anderson wrote:
On Wed, Aug 21, 2019 at 6:16 AM, Mark Wielaard <m...@klomp.org>
wrote:On Fri, 2019-08-16 at 14:24 -0500, Jonathon Anderson wrote:
> > For parallel applications that need the information in the
DIEs, the
> > Dwarf_Abbrev hash table et al. become a massive data race. This
> > fixes
> > that by:
> >
> > 1. Adding atomics & locks to the hash table to manage
concurrency
> > (lib/dynamicsizehash_concurrent.{c,h})
> > 2. Adding a lock & array structure to the memory manager
> > (pseudo-TLS)
> > (libdwP.h, libdw_alloc.c)
> > 3. Adding extra configure options for Helgrind/DRD annotations
> > (configure.ac)
> > 4. Including "stdatomic.h" from FreeBSD, to support C11-style
> > atomics.
> > (lib/stdatomic.h)
>
> This looks like really nice work. Thanks!
>
> I am splitting review in some smaller parts if you don't mind.
> Simply because it is large and I cannot keep everything in my
head at
> once :)
BTW. I would prefer to handle this as 4 separate additions, probably
in
this order:
1) configure stuff for valgrind annotations.
2) add support for stdatomic.h functions.
3) thread-safe obstack memory handling
4) concurrent dynamic hash table.
Sure thing, I can split the patch into bits over the weekend. I may
take your advice and just use git request-pull though.
> If the compiler provides stdatomic.h then I think it would be good
> to
> use that instead of our own implementation. The copyright isn't a
> problem. But do you have a reference/URL to the upstream version?
I
> like to add that somewhere, so we can sync with it in the future.
I
> see
> various commented out parts. Was that already upstream? Should we
just
> remove those parts?
It would definitely be preferable to use the compiler's
implementation
if possible, we used this in case GCC 4.7 and 4.8 (RHEL7)
compatibility
was needed. If those versions are old enough the file can be removed
entirely.
The upstream is at
https://github.com/freebsd/freebsd/blob/master/sys/sys/stdatomic.h,
although the version here appears to be slightly modified (we used
the
version that HPCToolkit ships). The components we use don't seem
affected, so a resync shouldn't make a difference.
OK, then we should come up with some kind of configure test to see if
we can use the standard stdatomic.h and otherwise use our own. I am
surprised I cannot find other projects doing this. Would be nice to
"steal" something standard for this.
At least OpenSSL does it:
https://github.com/openssl/openssl/blob/master/include/internal/refcount.h,
the interesting note being that it has a series of fallbacks (various
compiler builtins and then locks). The other projects I skimmed just
have the fallbacks and don't check for C11, given that Elfutils only
supports GCC that might be a valid (and more compact) approach.
> > - Currently the concurrent hash table is always enabled,
> > performance-wise there is no known difference between it
> > and the non-concurrent version.
> > This can be changed to toggle with --enable-thread-safety
> > if preferred.
>
> I would prefer it always enabled, unless there is a massive
slowdown
> of
> the single-threaded case. The problem with --enable-thread-safety
is
> that it is a) known broken (sigh) and b) it basically introduces
two
> separate libraries that behave subtly differently. I would very
much
> like to get rid of --enable-thread-safety by fixing the broken
locking
> and simply making it the default.
I haven't noticed any slowdown in the single-threaded case,
although I
haven't stressed it hard enough to find out for certain. From a
theoretical standpoint it shouldn't, atomics (with the proper memory
orders) are usually (on x86 at least) as cheap as normal accesses
when
used by a single thread, and the algorithm is otherwise effectively
the
same as the original hash table.
How difficult would it be to fix the locking (or rather, what's
"broken")? We would definitely benefit from having thread-safety at
least for getters, which would only need locks around the internal
management.
To be honest I don't know how badly it is broken.
It is only implemented for libelf.
If you configure --enable-thread-safety and make check you will see
several tests fail because they abort with Unexpected error: Resource
deadlock avoided.
I think it is mainly that nobody maintained the locks and now some are
just wrongly placed. Ideally we switch --enable-thread-safety on by
default, identify which locks are wrongly placed, run all tests with
valgrind/hellgrind and fix any issues found.
It really has not been a priority. Sorry.
No worries, its not a priority on our end either. Elfutils' codebase is
significantly simpler (IMHO) than ours, so if it ever comes up we'll
just submit another patch.
> > - Another implementation of #2 above might use dynamic TLS
> > (pthread_key_*),
> > we chose this implementation to reduce the overall
complexity.
>
> Are there any other trade-offs?
If the application spawns N threads that all use a Dwarf object
(same
or different) enough to cause an allocation, and then joins them
all,
any Dwarf objects allocated after will allocate N unusable slots in
the
mem_tails array. Thus long-running applications (that don't use
thread
pools) would start experiencing effects similar to a memory leak,
of 1
pointer's worth (8 bytes on 64-bit) per dead thread.
The alternative is to use dynamic TLS so that only threads that are
currently active use the extra memory, assuming libpthread is
sufficiently proactive about reclaiming unused key values. I think
if
we assume `dwarf_end` happens-after any memory management (which
would
make sense for a destructor), there should be a simple atomic
pattern
to handle the freeing, but I would need to sit down for a while to
work
out a sufficient proof.
I was also under the impression that dynamic TLS was particularly
expensive performance-wise, although I haven't experimented with it
myself enough to know. The compiler can be a little smarter about
static TLS, and the result is easier to reason about, which is why
we
chose this implementation for initial patch. If the alternative
would
be preferable we can change it.
I thought a bit about this one and although I am slightly worried
about
the possible indefinite growing of the thread_ids, I don't think the
"memory leak" is an issue. Concurrent usage of the Dwarf object
already
costs a bit more memory (since each thread gets its own memory block
if
they allocate at the same time) which is probably larger than any
extra
created by reserving space for all possible thread_ids. This is only
really a problem for a program that doesn't use thread pools and keeps
opening and concurrently accessing new Dwarf objects (because at a
certain point the whole Dwarf will have been read and no new
allocations happen). Although it would be nice if we could somehow
reset the next_id to zero in dwarf_end (), when this is the last
thread
or Dwarf object.
The memory overhead is a little worse than that (each thread allocates
its own memory blocks, period), but that would be present in both
implementations. I can't think of a simple way to increase the memory
efficiency past that (although I can think of some ridiculously complex
ways).
I suppose it would be possible to use a sort of free-list for the IDs,
although that requires a hook at thread exit (doable with PThreads, not
with C11) and cleaning up would be a bit of a nightmare. At some point
dynamic TLS is more robust against weird situations (and, if my
thought-proof is correct, simpler).
Cheers,
Mark