Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Xiaohong Gong Sun, 10 Aug 2025 20:11:47 -0700

On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Updating predicate checks

Thanks for your work @jatin-bhateja! This PR also provides help on AArch64 that 
we also have plan to do the same intrinsifaction in our side.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1667:

> 1665: bool LibraryCallKit::inline_vector_slice() {
> 1666:   const TypeInt*     origin         = 
> gvn().type(argument(0))->isa_int();
> 1667:   const TypeInstPtr* vector_klass = 
> gvn().type(argument(1))->isa_instptr();

Code style:
Suggestion:

  const TypeInt*     origin       = gvn().type(argument(0))->isa_int();
  const TypeInstPtr* vector_klass = gvn().type(argument(1))->isa_instptr();

src/hotspot/share/opto/vectorIntrinsics.cpp line 1700:

> 1698: 
> 1699:   if (!arch_supports_vector(Op_VectorSlice, num_elem, elem_bt, 
> VecMaskNotUsed)) {
> 1700:     log_if_needed("  ** not supported: arity=2 op=slice vlen=%d 
> etype=%s ismask=useload/none",

`ismask=useload/none` is not necessary here?

src/hotspot/share/opto/vectorIntrinsics.cpp line 1714:

> 1712:   }
> 1713: 
> 1714:   Node* origin_node = gvn().intcon(origin->get_con() * 
> type2aelembytes(elem_bt));

Q1: Is it possible that just passing `origin->get_con()` to `VectorSliceNode` 
in case there are architectures that need it directly? Or, maybe we'd better 
add comment telling that the origin passed to `VectorSliceNode` is adjust to 
bytes.

Q2: If `origin` is not a constant, and there is an architecture that support 
the index as a variable, will the code crash here? Can we just limit the 
`origin` to a constant for this intrinsifaction in this PR? We can consider to 
extend it to variable in case any architecture has such requirement. WDYT?

src/hotspot/share/opto/vectornode.hpp line 1719:

> 1717: class VectorSliceNode : public VectorNode {
> 1718:  public:
> 1719:   VectorSliceNode(Node* vec1, Node* vec2, Node* origin, const TypeVect* 
> vt)

Do we have specific value for `origin` like zero or vlen? If so, maybe simply 
Identity is better to be added as well.

-------------

PR Review: https://git.openjdk.org/jdk/pull/24104#pullrequestreview-3103877319
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265568519
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265573060
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265579342
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265580768

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2]

Reply via email to