Summary: We should stipulate to users that elfutils library functions updating 
a library data structure are not thread safe with respect to other calls acting 
on that data structure or related ones.  This will facilitate a more performant 
thread safety implementation.  Benchmarking data demonstrating this is included 
below.

Thread safety in elfutils libraries is currently in an experimental state but 
we are working towards official support. The implicit contract we have been 
designing thread safety around has aimed for maximum guarantees: elfutils 
library functions should be safe to concurrently call alongside most other 
library functions. These very strong guarantees aren't necessary for typical 
elfutils consumers and they require heavy use of internal synchronization 
within the libraries that hurts performance and limits multithreading 
scalability. I want to propose that we relax elfutils library thread safety 
guarantees to facilitate performance improvements while still accommodating 
typical consumer use cases expected when thread safety is officially available.

We should not guarantee thread safety when simultaneously reading data from a 
handle (gelf_get*, elf_getscn, elf_nextscn, elf_strptr, elf_getdata, etc) and 
updating the handle or associated handles (elf_newdata, elf_update, 
gelf_update_*, elf_flag*, etc). We should clearly document for users that they 
are responsible for serializing such calls. This will allow us to reduce 
internal rwlock usage and in some cases replace it with atomic flags that track 
whether internal lazy init has happened.  We've discussed replacing an rwlock 
with an atomic flag last year [1] and demonstrated that it's more performant 
specifically for __libdw_dieabbrev.

Below I detail additional benchmarking data demonstrating that this approach 
should be adopted more generally. While much of the existing rwlock usage in 
libelf and libdw is acceptable, in the worst case certain rwlocks impose an 
unacceptable performance hit and limit multithreading scalability. My 
conclusion is that we should relax our thread safety guarantees as I've 
described here and, before we start to officially support thread safety, 
clearly document the thread safety contract for users (ex. which functions they 
are responsible for serializing and when they should serialize, which are 
thread safe accessors, etc) and remove/replace hotpath rwlocks that cause 
performance problems. The good news is that from the benchmarking I've done so 
far, I can only find 3 libelf functions that need to be addressed.

I'll now describe benchmark results that support the above proposal. Using 
claude code I created a benchmarking script [2] as well as a proof-of-concept 
patch [3] that removes 3 hotpath rwlocks in libelf and replaces some of them 
with atomic flags that track whether internal lazy init has occurred.  The 
patch is not intended for upstream submission due to being LLM-generated. The 
benchmarking script compares elfutils consumer tool runtimes across 3 different 
elfutils build configs: main branch with --disable-thread-safety, main branch 
with --enable-thread-safety, and --enable-thread-safety with the rwlock removal 
patch applied to the main branch. The consumer tools tested are eu-readelf, 
eu-nm, eu-addr2line, abidiff, abidw, pahole, dwz, debugedit and perf script.  
With the proof of concept patch applied, the testsuite passes under memcheck 
and helgrind and the benchmarking script verifies that eu-readelf output is the 
same under the different builds. Here is the benchmarking data where the 
different consumer tools were run on a 42MB unstripped stap binary on Fedora 44 
x86_64. I also benchmarked on aarch64 and the results were similar. "rw cost" 
column lists the ratio of main branch --enable-thread-safety runtime ("rwlock") 
to main branch --disable-thread-safety runtime ("no-locks"). "atomic" lists the 
runtime with --enable-thread-safety and the atomic proof of concept patch. 
Except for "rw cost", the units are seconds of runtime.

  | Workload      | no-locks |  rwlock | atomic | rw cost | 
  |---------------|----------|---------|--------|---------|
  | eu-readelf -w |   344.58 | 1702.65 | 204.34 |   4.94x |
  | eu-readelf -a |     4.80 |   12.26 |   4.97 |   2.55x |
  | eu-nm         |     0.02 |    0.02 |   0.02 |   1.00x |
  | eu-addr2line  |     0.02 |    0.03 |   0.02 |   1.50x |
  | abidiff       |    27.57 |   27.86 |  27.51 |   1.01x |
  | abidw         |    19.58 |   19.52 |  19.54 |   1.00x |
  | pahole        |     0.52 |    0.52 |   0.52 |   1.00x |
  | dwz           |     0.46 |    0.45 |   0.45 |   0.98x |
  | debugedit     |     0.19 |    0.18 |   0.18 |   0.95x |
  | perf-script   |     0.26 |    0.26 |   0.26 |   1.00x |

The results show that most consumers do not experience a performance hit under 
--enable-thread-safety. But in the worst case (eu-readelf) there can be a 
significant hit.  However this disappears when rwlocks are removed from 
gelf_getsymshndx, gelf_getshdr (plus elf{32,64}_getshdr) and elf_getscn and 
atomic flags instead track internal initialization. The performance boost for 
eu-readelf -w (which uses multithreading when built with 
--enable-thread-safety) demonstrates multithreading scaling with rwlock removal.

This relaxed thread safety policy I'm proposing helps facilitate rwlock 
removal.  For example, gelf_getsymshndx locks/unlocks the rwlock of the Elf * 
associated with the function's Elf_Data arguments.  This is done to ensure that 
concurrent calls to elf_update, gelf_update_sym, etc, do not invalidate any 
data.  Under the new thread safety contract, the concurrent calls that may 
invalidate the arguments would be the user's responsibility to serialize. 
gelf_getsymshndx doesn't perform any internal initialization so under this 
contract it can be entirely lock free, saving potentially billions of 
lock/unlock calls during an eu-readelf run.  For library functions that require 
internal lazy initialization, an atomic flag can be used to track the 
initialization state.  The benchmark data demonstrates that the atomic flags in 
elf_getscn, gelf_getshdr (plus elf{32,64}_getshdr) are significantly more 
performant.

Aaron

[1] https://sourceware.org/pipermail/elfutils-devel/2025q3/008512.html
[2] 
https://gitlab.com/amerey1/elfutils-ai-lab/-/commit/e1f5c515423c2c546bb0805f2cdc2132282cd4d0
[3] 
https://gitlab.com/amerey1/elfutils-ai-lab/-/commit/8d4de2868f5d3e1eb4e2e5e0aeba6f971df0d401

Reply via email to