On Thu, 15 May 2025 16:03:44 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This intrinsic is generally faster than the current implementation for 
>> Panama segment operations for all writes larger than about 8 bytes in size, 
>> increasing to more than 2* the performance on larger memory blocks on 
>> Graviton 2, between "panama" (C2 generated, what we use now) and "unsafe" 
>> (this intrinsic).
>> 
>> 
>> Benchmark                       (aligned)  (size)  Mode  Cnt     Score    
>> Error  Units
>> MemorySegmentFillUnsafe.panama       true  262143  avgt   10  7295.638 ±  
>> 0.422  ns/op
>> MemorySegmentFillUnsafe.panama      false  262143  avgt   10  8345.300 ± 
>> 80.161  ns/op
>> MemorySegmentFillUnsafe.unsafe       true  262143  avgt   10  2930.594 ±  
>> 0.180  ns/op
>> MemorySegmentFillUnsafe.unsafe      false  262143  avgt   10  3136.828 ±  
>> 0.232  ns/op
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Copyright format correction

> One thing that sometimes helps is a count leading zeroes followed by a 
> multiway switch at the start, or just before the tail, to get started at the 
> right place in the tail (its log-size cascade), for very small inputs.
> 
> This PR #25383 uses clz in that way.
> 
> It also uses an overlapping-store technique to reduce an O(lg N) tail to an 
> O(1) tail, which also depends on the clz step.

> My rough notes on the relative performance of overlapping loads and stores 
> are here FWIW: https://cr.openjdk.org/~jrose/jvm/PartialMemoryWord.cpp

Mmm, interesting. I had a look at the M1 timings to see what might be going on, 
and I think it's because the processor can in each clock execute 8 instructions 
but only one taken branch. It can, however, execute two not-taken branches per 
clock. At present, if our block is 1 (mod 64) bytes long, then we have a string 
of 5 taken branches and 1 not-taken branch. 

However, I couldn't see anything in the JMH results. I realized on closer 
inspection that performance was very much limited by the C2-generated caller, 
which is doing far more work than the intrinsic itself. I eventually tweaked 
the benchmark to call the intrinsic 1000 times, and trivially converting us to 
ns.

I'm not keen on the overlapping-store technique in this case because the code 
gets IMHO unjustifiably complex, but also we would have different timing 
behaviour for differently aligned fill operations. This seems to me a bit much 
for something that should be fairly simple.

I did, however, implement the clz-optimized tail. It's great for very short 
strings (mod 64) but it's worse for the range 32...63 (mod 64). It's also 
missing the early exit from the log-size cascade, which short-circuits fills 
that are a whole number of words long.

I tried another thing, which was to have _two_ cascades: one of 
whole-word-sized stores, and one from 0 to 7 bytes. This was better for fills 
that are a whole number of words long and some other cases, but had its own 
timing spikes in a few places (e.g. 36 bytes.)

I measured the total time for arrays of size 1-128 bytes, and took the average 
throughput.

A: this PR, as checked in.   6.5 cycles/op, 1.8ns.
B: one clz-optimized tail.   6.9 cycles/op, 1.9ns.
C: two clz-optimized tails.  7.2 cycles/op, 2.0ns.

In conclusion: there isn't much in it. We could do better by keeping this code 
as short as possible, which would allow us to inline the whole thing into its 
caller rather than the palaver of a C-ABI call to the stub.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25147#issuecomment-2913030626

Reply via email to