Summary: We should stipulate to users that elfutils library functions updating
a library data structure are not thread safe with respect to other calls acting
on that data structure or related ones. This will facilitate a more performant
thread safety implementation. Benchmarking data demonstrating this is included
below.
Thread safety in elfutils libraries is currently in an experimental state but
we are working towards official support. The implicit contract we have been
designing thread safety around has aimed for maximum guarantees: elfutils
library functions should be safe to concurrently call alongside most other
library functions. These very strong guarantees aren't necessary for typical
elfutils consumers and they require heavy use of internal synchronization
within the libraries that hurts performance and limits multithreading
scalability. I want to propose that we relax elfutils library thread safety
guarantees to facilitate performance improvements while still accommodating
typical consumer use cases expected when thread safety is officially available.
We should not guarantee thread safety when simultaneously reading data from a
handle (gelf_get*, elf_getscn, elf_nextscn, elf_strptr, elf_getdata, etc) and
updating the handle or associated handles (elf_newdata, elf_update,
gelf_update_*, elf_flag*, etc). We should clearly document for users that they
are responsible for serializing such calls. This will allow us to reduce
internal rwlock usage and in some cases replace it with atomic flags that track
whether internal lazy init has happened. We've discussed replacing an rwlock
with an atomic flag last year [1] and demonstrated that it's more performant
specifically for __libdw_dieabbrev.
Below I detail additional benchmarking data demonstrating that this approach
should be adopted more generally. While much of the existing rwlock usage in
libelf and libdw is acceptable, in the worst case certain rwlocks impose an
unacceptable performance hit and limit multithreading scalability. My
conclusion is that we should relax our thread safety guarantees as I've
described here and, before we start to officially support thread safety,
clearly document the thread safety contract for users (ex. which functions they
are responsible for serializing and when they should serialize, which are
thread safe accessors, etc) and remove/replace hotpath rwlocks that cause
performance problems. The good news is that from the benchmarking I've done so
far, I can only find 3 libelf functions that need to be addressed.
I'll now describe benchmark results that support the above proposal. Using
claude code I created a benchmarking script [2] as well as a proof-of-concept
patch [3] that removes 3 hotpath rwlocks in libelf and replaces some of them
with atomic flags that track whether internal lazy init has occurred. The
patch is not intended for upstream submission due to being LLM-generated. The
benchmarking script compares elfutils consumer tool runtimes across 3 different
elfutils build configs: main branch with --disable-thread-safety, main branch
with --enable-thread-safety, and --enable-thread-safety with the rwlock removal
patch applied to the main branch. The consumer tools tested are eu-readelf,
eu-nm, eu-addr2line, abidiff, abidw, pahole, dwz, debugedit and perf script.
With the proof of concept patch applied, the testsuite passes under memcheck
and helgrind and the benchmarking script verifies that eu-readelf output is the
same under the different builds. Here is the benchmarking data where the
different consumer tools were run on a 42MB unstripped stap binary on Fedora 44
x86_64. I also benchmarked on aarch64 and the results were similar. "rw cost"
column lists the ratio of main branch --enable-thread-safety runtime ("rwlock")
to main branch --disable-thread-safety runtime ("no-locks"). "atomic" lists the
runtime with --enable-thread-safety and the atomic proof of concept patch.
Except for "rw cost", the units are seconds of runtime.
| Workload | no-locks | rwlock | atomic | rw cost |
|---------------|----------|---------|--------|---------|
| eu-readelf -w | 344.58 | 1702.65 | 204.34 | 4.94x |
| eu-readelf -a | 4.80 | 12.26 | 4.97 | 2.55x |
| eu-nm | 0.02 | 0.02 | 0.02 | 1.00x |
| eu-addr2line | 0.02 | 0.03 | 0.02 | 1.50x |
| abidiff | 27.57 | 27.86 | 27.51 | 1.01x |
| abidw | 19.58 | 19.52 | 19.54 | 1.00x |
| pahole | 0.52 | 0.52 | 0.52 | 1.00x |
| dwz | 0.46 | 0.45 | 0.45 | 0.98x |
| debugedit | 0.19 | 0.18 | 0.18 | 0.95x |
| perf-script | 0.26 | 0.26 | 0.26 | 1.00x |
The results show that most consumers do not experience a performance hit under
--enable-thread-safety. But in the worst case (eu-readelf) there can be a
significant hit. However this disappears when rwlocks are removed from
gelf_getsymshndx, gelf_getshdr (plus elf{32,64}_getshdr) and elf_getscn and
atomic flags instead track internal initialization. The performance boost for
eu-readelf -w (which uses multithreading when built with
--enable-thread-safety) demonstrates multithreading scaling with rwlock removal.
This relaxed thread safety policy I'm proposing helps facilitate rwlock
removal. For example, gelf_getsymshndx locks/unlocks the rwlock of the Elf *
associated with the function's Elf_Data arguments. This is done to ensure that
concurrent calls to elf_update, gelf_update_sym, etc, do not invalidate any
data. Under the new thread safety contract, the concurrent calls that may
invalidate the arguments would be the user's responsibility to serialize.
gelf_getsymshndx doesn't perform any internal initialization so under this
contract it can be entirely lock free, saving potentially billions of
lock/unlock calls during an eu-readelf run. For library functions that require
internal lazy initialization, an atomic flag can be used to track the
initialization state. The benchmark data demonstrates that the atomic flags in
elf_getscn, gelf_getshdr (plus elf{32,64}_getshdr) are significantly more
performant.
Aaron
[1] https://sourceware.org/pipermail/elfutils-devel/2025q3/008512.html
[2]
https://gitlab.com/amerey1/elfutils-ai-lab/-/commit/e1f5c515423c2c546bb0805f2cdc2132282cd4d0
[3]
https://gitlab.com/amerey1/elfutils-ai-lab/-/commit/8d4de2868f5d3e1eb4e2e5e0aeba6f971df0d401