Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Jatin Bhateja Sun, 27 Jul 2025 22:56:41 -0700

On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Updating predicate checks

Performance on AVX512 machine


Baseline:
Benchmark                                                (size)   Mode  Cnt     
 Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    4  
35741.780 ±  1561.065  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    4  
35011.929 ±  5886.902  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    4  
32366.844 ±  1489.449  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    4  
10636.281 ±   608.705  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    4  
10750.833 ±   328.997  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    4  
10257.338 ±  2027.422  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    4   
5362.330 ±  4199.651  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    4   
4992.399 ±  6053.641  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    4   
4941.258 ±   478.193  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    4  
40432.828 ± 26672.673  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    4  
41300.811 ± 34342.482  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    4  
36958.309 ±  1899.676  ops/ms

Withopt:
Benchmark                                                (size)   Mode  Cnt     
 Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt   10  
67936.711 ±   389.783  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt   10  
70086.731 ±  5972.968  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt   10  
31879.187 ±   148.213  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt   10  
17676.883 ±   217.238  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt   10  
16983.007 ±  3988.548  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt   10   
9851.266 ±    31.773  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt   10   
9194.216 ±    42.772  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt   10   
8411.738 ±    33.209  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt   10   
5244.850 ±    12.214  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt   10  
61233.526 ± 20472.895  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt   10  
61545.276 ± 20722.066  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt   10  
41208.718 ±  5374.829  ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3125629912

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Reply via email to