On Thu, 27 Feb 2025 06:43:19 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> Method `checkMaskFromIndexSize` is called by some vector masked APIs like > `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the > index of any active lanes in a mask will reach out of the boundary of the > given Array/MemorySegment. This function should be force inlined, or a > VectorMask object is generated once the function call is not inlined by C2 > compiler, which affects the API performance a lot. > > This patch changed to call the `VectorMask.checkFromIndexSize` method > directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it > has added the `@ForceInline` annotation already, it will be inlined and > intrinsified by C2. And then the expected vector instructions can be > generated. With this change, the unused `checkMaskFromIndexSize` can be > removed. > > Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace > CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar > performance improvement on a Intel CPU which supports AVX512. > > Following is the performance data on Grace: > > > Benchmark Mode Cnt Units > Before After Gain > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE thrpt 30 ops/ms > 31544.304 31610.598 1.002 > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE thrpt 30 ops/ms > 3896.202 3903.249 1.001 > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE thrpt 30 ops/ms > 570.415 7174.320 12.57 > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE thrpt 30 ops/ms > 566.694 7193.520 12.69 > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE thrpt 30 ops/ms > 3899.269 3878.258 0.994 > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE thrpt 30 ops/ms > 1134.301 16053.847 14.15 > StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE thrpt 30 ops/ms > 26449.558 28699.480 1.085 > StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE thrpt 30 ops/ms > 1922.167 5781.077 3.007 > StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE thrpt 30 ops/ms > 3784.190 11789.276 3.115 > StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE thrpt 30 ops/ms > 3694.082 15633.547 4.232 > StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE thrpt 30 ops/ms > 1966.956 6049.790 3.075 > StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE thrpt 30 ops/ms > 7647.309 27412.387 3.584 This pull request has now been integrated. Changeset: d48ddfe4 Author: Xiaohong Gong <xg...@openjdk.org> URL: https://git.openjdk.org/jdk/commit/d48ddfe49a4e0b07949912d3c91d6f4737658b3e Stats: 213 lines in 7 files changed: 36 ins; 140 del; 37 mod 8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined Reviewed-by: psandoz ------------- PR: https://git.openjdk.org/jdk/pull/23817