On Thu, 15 May 2025 16:03:44 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This intrinsic is generally faster than the current implementation for 
>> Panama segment operations for all writes larger than about 8 bytes in size, 
>> increasing to more than 2* the performance on larger memory blocks on 
>> Graviton 2, between "panama" (C2 generated, what we use now) and "unsafe" 
>> (this intrinsic).
>> 
>> 
>> Benchmark                       (aligned)  (size)  Mode  Cnt     Score    
>> Error  Units
>> MemorySegmentFillUnsafe.panama       true  262143  avgt   10  7295.638 ±  
>> 0.422  ns/op
>> MemorySegmentFillUnsafe.panama      false  262143  avgt   10  8345.300 ± 
>> 80.161  ns/op
>> MemorySegmentFillUnsafe.unsafe       true  262143  avgt   10  2930.594 ±  
>> 0.180  ns/op
>> MemorySegmentFillUnsafe.unsafe      false  262143  avgt   10  3136.828 ±  
>> 0.232  ns/op
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Copyright format correction

Nice!

There's a nicely written loop tail that handles power-of-two chunks from 32 
bytes (stpq) down to a single byte.

Like many such tails, it is O(lg N), N being the max tail size, and that can be 
annoying when the loop tail is most or all of the work.

One thing that sometimes helps is a count leading zeroes followed by a multiway 
switch at the start, or just before the tail, to get started at the right place 
in the tail (its log-size cascade), for very small inputs.

This PR https://github.com/openjdk/jdk/pull/25383 uses clz in that way.

It also uses an overlapping-store technique to reduce an O(lg N) tail to an 
O(1) tail, which also depends on the clz step.

When atomicity is not an issue, the overlapping-store technique is faster on my 
MacBook M1.  It lets you (say) store 7 bytes in two cycles and no extra 
branches.  The downside is some bytes get stored twice (in the overlap), so it 
only works on unshared memory.

My rough notes on the relative performance of overlapping loads and stores are 
here FWIW:
https://cr.openjdk.org/~jrose/jvm/PartialMemoryWord.cpp

BTW, overlapping loads (properly bit-masked) are just as atomic as loads of 
individual bytes, and much faster.  But that's not the topic here.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25147#issuecomment-2902463076

Reply via email to