Re: RFR: 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas, take 2 [v2]

Peter Levart Sun, 22 Feb 2026 07:01:49 -0800

On Sun, 22 Feb 2026 14:54:57 GMT, Peter Levart <[email protected]> wrote:


>> Hi,
>> 
>> When administering my mailing lists, my attention was drawn to this pull 
>> request: https://github.com/openjdk/jdk/pull/28575, which tries to tackle 
>> this scaling problem. Although it was dismissed, I remembered that I was 
>> dealing with a similar problem in the past, so I looked closely...
>> 
>> Here's an alternative take at the problem. It reuses a maintained public 
>> component of JDK, the LongAdder, so in this respect, it does not add 
>> complexity and maintainance burden. It also does not change the internal API 
>> of the MemorySessionImpl. The size of the patch is also smaller.
>> 
>> For experimenting and benchmarking, I created a separate impmenetation of 
>> just the acquire/release/close logic with existing "simple" and this new 
>> "striped" implementations here:
>> 
>> https://github.com/plevart/acquire-release-close
>> 
>> Running it on my 8 core (16 threads) Linux PC, it gives promising results 
>> without regression for single-threaded use:
>> 
>> 
>> ** Simple, measure run #1...
>> concurrency: 1, nanos: 39909697 (x 1.0)
>> concurrency: 2, nanos: 164735444 (x 4.127704702944751)
>> concurrency: 4, nanos: 394283724 (x 9.87939657873123)
>> concurrency: 8, nanos: 672278915 (x 16.84500172978011)
>> concurrency: 16, nanos: 2169282886 (x 54.3547821473062)
>> ** Simple, measure run #2...
>> concurrency: 1, nanos: 40318379 (x 1.0)
>> concurrency: 2, nanos: 163438657 (x 4.053701092496799)
>> concurrency: 4, nanos: 399382210 (x 9.905710991009832)
>> concurrency: 8, nanos: 694862623 (x 17.23438888750959)
>> concurrency: 16, nanos: 2182386494 (x 54.12882531810121)
>> ** Simple, measure run #3...
>> concurrency: 1, nanos: 39871197 (x 1.0)
>> concurrency: 2, nanos: 168843686 (x 4.234728292707139)
>> concurrency: 4, nanos: 375489497 (x 9.417562683156966)
>> concurrency: 8, nanos: 675885694 (x 16.951728186138983)
>> concurrency: 16, nanos: 2083500812 (x 52.255787856080666)
>> ** end.
>> 
>> ** Striped, measure run #1...
>> concurrency: 1, nanos: 36698350 (x 1.0)
>> concurrency: 2, nanos: 47349695 (x 1.290240433152989)
>> concurrency: 4, nanos: 58622304 (x 1.5974098018030782)
>> concurrency: 8, nanos: 60548173 (x 1.6498881557345222)
>> concurrency: 16, nanos: 70607406 (x 1.9239940215295783)
>> ** Striped, measure run #2...
>> concurrency: 1, nanos: 37217044 (x 1.0)
>> concurrency: 2, nanos: 38610020 (x 1.0374284427317764)
>> concurrency: 4, nanos: 39166893 (x 1.0523912914738742)
>> concurrency: 8, nanos: 51778829 (x 1.3912665659314587)
>> concurrency: 16, nanos: 70277394 (x 1.8883120862581133)
>> ** Striped, measu...
>
> Peter Levart has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8371260: Prevent two theoretical reorderings of volatile write beyond 
> volatile read

So after JMM theory kicks in, some faults are discovered. I mitigated them with 
two explicit fullFence(s). This does introduce some overhead to single-thread 
usage though. Here's again the comparison report:


** Simple, measure run #1...
concurrency: 1, nanos: 39909697 (x 1.0)
concurrency: 2, nanos: 164735444 (x 4.127704702944751)
concurrency: 4, nanos: 394283724 (x 9.87939657873123)
concurrency: 8, nanos: 672278915 (x 16.84500172978011)
concurrency: 16, nanos: 2169282886 (x 54.3547821473062)
** Simple, measure run #2...
concurrency: 1, nanos: 40318379 (x 1.0)
concurrency: 2, nanos: 163438657 (x 4.053701092496799)
concurrency: 4, nanos: 399382210 (x 9.905710991009832)
concurrency: 8, nanos: 694862623 (x 17.23438888750959)
concurrency: 16, nanos: 2182386494 (x 54.12882531810121)
** Simple, measure run #3...
concurrency: 1, nanos: 39871197 (x 1.0)
concurrency: 2, nanos: 168843686 (x 4.234728292707139)
concurrency: 4, nanos: 375489497 (x 9.417562683156966)
concurrency: 8, nanos: 675885694 (x 16.951728186138983)
concurrency: 16, nanos: 2083500812 (x 52.255787856080666)
** end.

** Striped, measure run #1...
concurrency: 1, nanos: 58248553 (x 1.0)
concurrency: 2, nanos: 77375592 (x 1.3283693416384095)
concurrency: 4, nanos: 70015083 (x 1.2020055330816544)
concurrency: 8, nanos: 60701425 (x 1.0421104366317906)
concurrency: 16, nanos: 65387340 (x 1.1225573277331027)
** Striped, measure run #2...
concurrency: 1, nanos: 58836025 (x 1.0)
concurrency: 2, nanos: 78600629 (x 1.3359269087264138)
concurrency: 4, nanos: 63892822 (x 1.085947291646572)
concurrency: 8, nanos: 62594145 (x 1.063874471465399)
concurrency: 16, nanos: 89972108 (x 1.5292009954785355)
** Striped, measure run #3...
concurrency: 1, nanos: 59242988 (x 1.0)
concurrency: 2, nanos: 63316159 (x 1.0687536388272652)
concurrency: 4, nanos: 60279613 (x 1.0174978513912905)
concurrency: 8, nanos: 66596046 (x 1.1241169334672991)
concurrency: 16, nanos: 107654519 (x 1.8171689618356184)
** end.


There is a 50% increase in latency for single-thread usage which is payed off 
whenever there is contention in Simple implementation though. I wonder what 
results would be like on other hardware.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29866#issuecomment-3941129435

Re: RFR: 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas, take 2 [v2]

Reply via email to