Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093

Fei Gao Thu, 11 Jun 2026 05:27:19 -0700

On Mon, 8 Jun 2026 08:13:18 GMT, Fei Gao <[email protected]> wrote:

>>> @fg1417 Nice progress, I had some responses and new comments above. Main 
>>> new idea: what about Vector API vectors that create these patterns, do they 
>>> also get optimized by your changes now?
>> 
>> Hi @eme64,
>> Thanks for your reviewing!
>> I've already created the Vector API benchmark locally, but I'm currently 
>> waiting for access to testing resources. Sorry for the delay, and thanks for 
>> your patience.
>
>> @fg1417 Nice progress, I had some responses and new comments above. Main new 
>> idea: what about Vector API vectors that create these patterns, do they also 
>> get optimized by your changes now?
> 
> Hi @eme64, thanks for your patience.
> 
> I’ve pushed the Vector API microbenchmarks in 
> `test/micro/org/openjdk/bench/jdk/incubator/vector/LongVectorReduction.java` 
> that mirror the auto-vectorization patterns, along with the corresponding IR 
> test cases. The change also benefits these Vector API microbenchmarks.
> 
> On an `Arm Neoverse V2` platform, I observed the following results:
> 
> 
> Benchmark                                 (size)    Mode   Cnt   Units        
> uplift          
> LongVectorReduction.addBig                 512     thrpt    5    ops/ms       
>  2.97%
> LongVectorReduction.addBig                 2048    thrpt    5    ops/ms       
>  0.37%
> LongVectorReduction.addDotProduct          512     thrpt    5    ops/ms       
>  50.99%
> LongVectorReduction.addDotProduct          2048    thrpt    5    ops/ms       
>  49.95%
> LongVectorReduction.addDotProductShared    512     thrpt    5    ops/ms       
>  0.29%
> LongVectorReduction.addDotProductShared    2048    thrpt    5    ops/ms       
>  -0.01%
> LongVectorReduction.ifElsePhiAdd           512     thrpt    5    ops/ms       
>  8.50%
> LongVectorReduction.ifElsePhiAdd           2048    thrpt    5    ops/ms       
>  16.04%
> LongVectorReduction.ifElsePhiSub           512     thrpt    5    ops/ms       
>  10.55%
> LongVectorReduction.ifElsePhiSub           2048    thrpt    5    ops/ms       
>  11.78%
> LongVectorReduction.subDotProduct          512     thrpt    5    ops/ms       
>  50.74%
> LongVectorReduction.subDotProduct          2048    thrpt    5    ops/ms       
>  50.49%
> 
> 
> Thanks!


> @fg1417 Thanks for the updates and benchmarks! I think the code is 
> reasonable. I gave the PR another scan :)

@eme64 Thanks for the review!

I’ve now extended the patch to cover masked operations as well, and added the 
corresponding IR test cases and microbenchmarks in the latest commit.

On an `Arm Neoverse V2` system, I observed the following improvements:


Benchmark                                (size)    Mode    Cnt   Units         
Uplift      
LongVectorReduction.addDotProductMasked    512     thrpt    5    ops/ms        
49.65%
LongVectorReduction.addDotProductMasked    2048    thrpt    5    ops/ms        
50.11%
LongVectorReduction.subDotProductMasked    512     thrpt    5    ops/ms        
50.47%
LongVectorReduction.subDotProductMasked    2048    thrpt    5    ops/ms        
49.67%


Please let me know if you have any comments or further suggestions. Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30237#issuecomment-4680462400

Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093

Reply via email to