My email client didn't properly wrap the lines of the proposal. Here
it is again with wrapping:

Summary: We should stipulate to users that elfutils library functions
updating a library data structure are not thread safe with respect to
other calls acting on that data structure or related ones.  This will
facilitate a more performant thread safety implementation.
Benchmarking data demonstrating this is included below.

Thread safety in elfutils libraries is currently in an experimental
state but we are working towards official support. The implicit
contract we have been designing thread safety around has aimed for
maximum guarantees: elfutils library functions should be safe to
concurrently call alongside most other library functions. These very
strong guarantees aren't necessary for typical elfutils consumers and
they require heavy use of internal synchronization within the
libraries that hurts performance and limits multithreading
scalability. I want to propose that we relax elfutils library thread
safety guarantees to facilitate performance improvements while still
accommodating typical consumer use cases expected when thread safety
is officially available.

We should not guarantee thread safety when simultaneously reading data
from a handle (gelf_get*, elf_getscn, elf_nextscn, elf_strptr,
elf_getdata, etc) and updating the handle or associated handles
(elf_newdata, elf_update, gelf_update_*, elf_flag*, etc). We should
clearly document for users that they are responsible for serializing
such calls. This will allow us to reduce internal rwlock usage and in
some cases replace it with atomic flags that track whether internal
lazy init has happened.  We've discussed replacing an rwlock with an
atomic flag last year [1] and demonstrated that it's more performant
specifically for __libdw_dieabbrev.

Below I detail additional benchmarking data demonstrating that this
approach should be adopted more generally. While much of the existing
rwlock usage in libelf and libdw is acceptable, in the worst case
certain rwlocks impose an unacceptable performance hit and limit
multithreading scalability. My conclusion is that we should relax our
thread safety guarantees as I've described here and, before we start
to officially support thread safety, clearly document the thread
safety contract for users (ex. which functions they are responsible
for serializing and when they should serialize, which are thread safe
accessors, etc) and remove/replace hotpath rwlocks that cause
performance problems. The good news is that from the benchmarking I've
done so far, I can only find 3 libelf functions that need to be
addressed.

I'll now describe benchmark results that support the above proposal.
Using claude code I created a benchmarking script [2] as well as a
proof-of-concept patch [3] that removes 3 hotpath rwlocks in libelf
and replaces some of them with atomic flags that track whether
internal lazy init has occurred.  The patch is not intended for
upstream submission due to being LLM-generated. The benchmarking
script compares elfutils consumer tool runtimes across 3 different
elfutils build configs: main branch with --disable-thread-safety, main
branch with --enable-thread-safety, and --enable-thread-safety with
the rwlock removal patch applied to the main branch. The consumer
tools tested are eu-readelf, eu-nm, eu-addr2line, abidiff, abidw,
pahole, dwz, debugedit and perf script.  With the proof of concept
patch applied, the testsuite passes under memcheck and helgrind and
the benchmarking script verifies that eu-readelf output is the same
under the different builds. Here is the benchmarking data where the
different consumer tools were run on a 42MB unstripped stap binary on
Fedora 44 x86_64. I also benchmarked on aarch64 and the results were
similar. "rw cost" column lists the ratio of main branch
--enable-thread-safety runtime ("rwlock") to main branch
--disable-thread-safety runtime ("no-locks"). "atomic" lists the
runtime with --enable-thread-safety and the atomic proof of concept
patch. Except for "rw cost", the units are seconds of runtime.

  | Workload      | no-locks |  rwlock | atomic | rw cost |
  |---------------|----------|---------|--------|---------|
  | eu-readelf -w |   344.58 | 1702.65 | 204.34 |   4.94x |
  | eu-readelf -a |     4.80 |   12.26 |   4.97 |   2.55x |
  | eu-nm         |     0.02 |    0.02 |   0.02 |   1.00x |
  | eu-addr2line  |     0.02 |    0.03 |   0.02 |   1.50x |
  | abidiff       |    27.57 |   27.86 |  27.51 |   1.01x |
  | abidw         |    19.58 |   19.52 |  19.54 |   1.00x |
  | pahole        |     0.52 |    0.52 |   0.52 |   1.00x |
  | dwz           |     0.46 |    0.45 |   0.45 |   0.98x |
  | debugedit     |     0.19 |    0.18 |   0.18 |   0.95x |
  | perf-script   |     0.26 |    0.26 |   0.26 |   1.00x |

The results show that most consumers do not experience a performance
hit under --enable-thread-safety. But in the worst case (eu-readelf)
there can be a significant hit.  However this disappears when rwlocks
are removed from gelf_getsymshndx, gelf_getshdr (plus
elf{32,64}_getshdr) and elf_getscn and atomic flags instead track
internal initialization. The performance boost for eu-readelf -w
(which uses multithreading when built with --enable-thread-safety)
demonstrates multithreading scaling with rwlock removal.

This relaxed thread safety policy I'm proposing helps facilitate
rwlock removal.  For example, gelf_getsymshndx locks/unlocks the
rwlock of the Elf * associated with the function's Elf_Data arguments.
This is done to ensure that concurrent calls to elf_update,
gelf_update_sym, etc, do not invalidate any data.  Under the new
thread safety contract, the concurrent calls that may invalidate the
arguments would be the user's responsibility to serialize.
gelf_getsymshndx doesn't perform any internal initialization so under
this contract it can be entirely lock free, saving potentially
billions of lock/unlock calls during an eu-readelf run.  For library
functions that require internal lazy initialization, an atomic flag
can be used to track the initialization state.  The benchmark data
demonstrates that the atomic flags in elf_getscn, gelf_getshdr (plus
elf{32,64}_getshdr) are significantly more performant.

Aaron

Reply via email to