Here is some additional data showing how the performance scales on larger binaries. I used the exact same set up as I described in the previous email, only replacing the 64M stap binary with firefox's 1G libxul.so.debug:
$ hyperfine --runs 5 -i --warmup 2 'eu-readelf -C1 -N -w libxul.so.debug' thread safety enabled, this patch applied (__atomic builtins, lazy loading) Time (mean ± σ): 24.117 s ± 0.074 s [User: 23.836 s, System: 0.234 s] Range (min … max): 24.041 s … 24.210 s 5 runs thread safety enabled, v1 patch applied (eager loading) Time (mean ± σ): 24.436 s ± 0.185 s [User: 24.143 s, System: 0.245 s] Range (min … max): 24.207 s … 24.632 s 5 runs thread safety enabled, main branch (rwlock, lazy loading) Time (mean ± σ): 25.179 s ± 0.154 s [User: 24.904 s, System: 0.226 s] Range (min … max): 25.020 s … 25.384 s 5 runs thread safety disabled, main branch (lazy loading) Time (mean ± σ): 23.957 s ± 0.124 s [User: 23.681 s, System: 0.230 s] Range (min … max): 23.769 s … 24.095 s 5 runs The results with libxul.so.debug are consistent with what I reported for the stap binary. With this patch applied, `eu-readelf -N -w` with thread safety enabled is just 0.7% slower compared to `eu-readelf -N -w` with thread safety disabled. Patch v1 with eager abbrev loading is 1.3% slower than this patch. And the existing main branch thread safe implementation is 4.4% slower than this patch. Aaron