On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR >> instruction. >> It also adds a new hybrid call generator to facilitate lazy intrinsification >> or else perform procedural inlining to prevent call overhead and boxing >> penalties in case the fallback implementation expects to operate over >> vectors. The existing vector API-based slice implementation is now the >> fallback code that gets inlined in case intrinsification fails. >> >> Idea here is to add infrastructure support to enable intrinsification of >> fast path for selected vector APIs, else enable inlining of fall-back >> implementation if it's based on vector APIs. Existing call generators like >> PredictedCallGenerator, used to handle bi-morphic inlining, already make use >> of multiple call generators to handle hit/miss scenarios for a particular >> receiver type. The newly added hybrid call generator is lazy and called >> during incremental inlining optimization. It also relieves the inline >> expander to handle slow paths, which can easily be implemented library side >> (Java). >> >> Vector API jtreg tests pass at AVX level 2, remaining validation in progress. >> >> Performance numbers: >> >> >> System : 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark (size) Mode Cnt >> Score Error Units >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 >> 9444.444 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10009.319 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 >> 9081.926 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 >> 6085.825 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 >> 6505.378 ops/ms >> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 >> 6204.489 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 >> 1651.334 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 >> 1642.784 ops/ms >> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 >> 1474.808 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 >> 10399.394 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10502.894 ops/ms >> VectorSliceB... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Updating predicate checks Thanks for your work @jatin-bhateja! This PR also provides help on AArch64 that we also have plan to do the same intrinsifaction in our side. src/hotspot/share/opto/vectorIntrinsics.cpp line 1667: > 1665: bool LibraryCallKit::inline_vector_slice() { > 1666: const TypeInt* origin = > gvn().type(argument(0))->isa_int(); > 1667: const TypeInstPtr* vector_klass = > gvn().type(argument(1))->isa_instptr(); Code style: Suggestion: const TypeInt* origin = gvn().type(argument(0))->isa_int(); const TypeInstPtr* vector_klass = gvn().type(argument(1))->isa_instptr(); src/hotspot/share/opto/vectorIntrinsics.cpp line 1700: > 1698: > 1699: if (!arch_supports_vector(Op_VectorSlice, num_elem, elem_bt, > VecMaskNotUsed)) { > 1700: log_if_needed(" ** not supported: arity=2 op=slice vlen=%d > etype=%s ismask=useload/none", `ismask=useload/none` is not necessary here? src/hotspot/share/opto/vectorIntrinsics.cpp line 1714: > 1712: } > 1713: > 1714: Node* origin_node = gvn().intcon(origin->get_con() * > type2aelembytes(elem_bt)); Q1: Is it possible that just passing `origin->get_con()` to `VectorSliceNode` in case there are architectures that need it directly? Or, maybe we'd better add comment telling that the origin passed to `VectorSliceNode` is adjust to bytes. Q2: If `origin` is not a constant, and there is an architecture that support the index as a variable, will the code crash here? Can we just limit the `origin` to a constant for this intrinsifaction in this PR? We can consider to extend it to variable in case any architecture has such requirement. WDYT? src/hotspot/share/opto/vectornode.hpp line 1719: > 1717: class VectorSliceNode : public VectorNode { > 1718: public: > 1719: VectorSliceNode(Node* vec1, Node* vec2, Node* origin, const TypeVect* > vt) Do we have specific value for `origin` like zero or vlen? If so, maybe simply Identity is better to be added as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/24104#pullrequestreview-3103877319 PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265568519 PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265573060 PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265579342 PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2265580768