Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Jatin Bhateja Thu, 07 Aug 2025 05:38:34 -0700

On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Updating predicate checks

Adding additional notes on implementation:

  A) Slice:-
1.   New inline expander and C2 IR node VectorSlice for leaf level intrinsic 
corresponding to Vector.slice(int)
2.   Other interfaces of slice APIs.
     - Vector.slice(int, Vector)
          - The second vector argument is the background vector, which replaces 
the zero broadcasted vector of the base version of API. 
          - API internally calls the same intrinsic entry point as the base 
version.
      - Vector.slice(int, Vector, VectorMask)
          - This version of the API internally calls the above slice API with 
index and vector arguments, followed by an explicit blend with a broadcasted 
zero vector.
          
 Thus, current support implicitly covers all three 3 variants of slice APIs.


 B) Similar extensions to optimize Unslice with constant index:-

1.    Similar to slice, unslice also has three interfaces.
2.    Leaf-level interface only accepts an index argument. 
3.    Other variants of unslice accept unslice index, background vector, and 
part number.
4.    We can assume the receiver vector to be sliding over two contiguously 
placed background vectors.
5.    It's possible to implement all three variants of unslice using slice 
operations as follows.



jshell> // Input
jshell> vec
vec ==> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
jshell> vec2
vec2 ==> [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
jshell> bzvec
bzvec ==> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

jshell> // Case 1:
jshell> vec.unslice(4)
$79 ==> [0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
jshell> bzvec.slice(vec.length() - 4, vec)
$80 ==> [0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

jshell> // Case 2:
jshell> vec.unslice(4, vec2, 0)
$81 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
jshell> vec2.blend(vec2.slice(vec2.length() - 4, vec), 
VectorMask.fromLong(IntVector.SPECIES_512, ((1L << (vec.length() - 4)) - 1) << 
4))
$82 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

jshell> // Case 3:
jshell> vec.unslice(4, vec2, 1)
$83 ==> [13, 14, 15, 16, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
jshell> vec2.blend(vec.slice(vec.length() - 4, vec2), 
VectorMask.fromLong(IntVector.SPECIES_512, ((1L << 4) - 1)))
$84 ==> [13, 14, 15, 16, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]

jshell> // Case 4:
jshell> vec.unslice(4, vec2, 0, VectorMask.fromLong(IntVector.SPECIES_512, 
0xFF))
$85 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 130, 140, 150, 160]
jshell> // Current Java fallback implementation for this version is based on 
slice and unslice operations.


To ease the review process, I plan to optimize the unslice API with a constant 
index by extending the newly added expander in a follow-up patch.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3163994472

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Reply via email to