On Thu, 27 Feb 2025 23:30:29 GMT, Paul Sandoz <psan...@openjdk.org> wrote:
>> Method `checkMaskFromIndexSize` is called by some vector masked APIs like >> `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the >> index of any active lanes in a mask will reach out of the boundary of the >> given Array/MemorySegment. This function should be force inlined, or a >> VectorMask object is generated once the function call is not inlined by C2 >> compiler, which affects the API performance a lot. >> >> This patch changed to call the `VectorMask.checkFromIndexSize` method >> directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it >> has added the `@ForceInline` annotation already, it will be inlined and >> intrinsified by C2. And then the expected vector instructions can be >> generated. With this change, the unused `checkMaskFromIndexSize` can be >> removed. >> >> Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace >> CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar >> performance improvement on a Intel CPU which supports AVX512. >> >> Following is the performance data on Grace: >> >> >> Benchmark Mode Cnt Units >> Before After Gain >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE thrpt 30 ops/ms >> 31544.304 31610.598 1.002 >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE thrpt 30 ops/ms >> 3896.202 3903.249 1.001 >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE thrpt 30 ops/ms >> 570.415 7174.320 12.57 >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE thrpt 30 ops/ms >> 566.694 7193.520 12.69 >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE thrpt 30 ops/ms >> 3899.269 3878.258 0.994 >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE thrpt 30 ops/ms >> 1134.301 16053.847 14.15 >> StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE thrpt 30 ops/ms >> 26449.558 28699.480 1.085 >> StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE thrpt 30 ops/ms >> 1922.167 5781.077 3.007 >> StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE thrpt 30 ops/ms >> 3784.190 11789.276 3.115 >> StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE thrpt 30 ops/ms >> 3694.082 15633.547 4.232 >> StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE thrpt 30 ops/ms >> 1966.956 6049.790 3.075 >> StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE thrpt 30 ops/ms >> 7647.309 27412.387 3.584 > > Marked as reviewed by psandoz (Reviewer). Thanks a lot for your review @PaulSandoz ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23817#issuecomment-2693122599