On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR >> instruction. >> It also adds a new hybrid call generator to facilitate lazy intrinsification >> or else perform procedural inlining to prevent call overhead and boxing >> penalties in case the fallback implementation expects to operate over >> vectors. The existing vector API-based slice implementation is now the >> fallback code that gets inlined in case intrinsification fails. >> >> Idea here is to add infrastructure support to enable intrinsification of >> fast path for selected vector APIs, else enable inlining of fall-back >> implementation if it's based on vector APIs. Existing call generators like >> PredictedCallGenerator, used to handle bi-morphic inlining, already make use >> of multiple call generators to handle hit/miss scenarios for a particular >> receiver type. The newly added hybrid call generator is lazy and called >> during incremental inlining optimization. It also relieves the inline >> expander to handle slow paths, which can easily be implemented library side >> (Java). >> >> Vector API jtreg tests pass at AVX level 2, remaining validation in progress. >> >> Performance numbers: >> >> >> System : 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark (size) Mode Cnt >> Score Error Units >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 >> 9444.444 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10009.319 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 >> 9081.926 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 >> 6085.825 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 >> 6505.378 ops/ms >> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 >> 6204.489 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 >> 1651.334 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 >> 1642.784 ops/ms >> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 >> 1474.808 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 >> 10399.394 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10502.894 ops/ms >> VectorSliceB... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Updating predicate checks Performance on AVX512 machine Baseline: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 4 35741.780 ± 1561.065 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 4 35011.929 ± 5886.902 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 4 32366.844 ± 1489.449 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 4 10636.281 ± 608.705 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 4 10750.833 ± 328.997 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 4 10257.338 ± 2027.422 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 4 5362.330 ± 4199.651 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 4 4992.399 ± 6053.641 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 4 4941.258 ± 478.193 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 4 40432.828 ± 26672.673 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 4 41300.811 ± 34342.482 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 4 36958.309 ± 1899.676 ops/ms Withopt: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 10 67936.711 ± 389.783 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 10 70086.731 ± 5972.968 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 10 31879.187 ± 148.213 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 10 17676.883 ± 217.238 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 10 16983.007 ± 3988.548 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 10 9851.266 ± 31.773 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 10 9194.216 ± 42.772 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 10 8411.738 ± 33.209 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 10 5244.850 ± 12.214 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 10 61233.526 ± 20472.895 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 10 61545.276 ± 20722.066 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 10 41208.718 ± 5374.829 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3125629912