On Thu, 27 Feb 2025 06:43:19 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

> Method `checkMaskFromIndexSize` is called by some vector masked APIs like 
> `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the 
> index of any active lanes in a mask will reach out of the boundary of the 
> given Array/MemorySegment. This function should be force inlined, or a 
> VectorMask object is generated once the function call is not inlined by C2 
> compiler, which affects the API performance a lot.
> 
> This patch changed to call the `VectorMask.checkFromIndexSize` method 
> directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it 
> has added the `@ForceInline` annotation already, it will be inlined and 
> intrinsified by C2. And then the expected vector instructions can be 
> generated. With this change, the unused `checkMaskFromIndexSize` can be 
> removed.
> 
> Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace 
> CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar 
> performance improvement on a Intel CPU which supports AVX512.
> 
> Following is the performance data on Grace:
> 
> 
> Benchmark                                             Mode  Cnt  Units     
> Before      After   Gain
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE      thrpt   30  ops/ms  
> 31544.304  31610.598  1.002
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE    thrpt   30  ops/ms   
> 3896.202   3903.249  1.001
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE     thrpt   30  ops/ms    
> 570.415   7174.320  12.57
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE       thrpt   30  ops/ms    
> 566.694   7193.520  12.69
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE      thrpt   30  ops/ms   
> 3899.269   3878.258  0.994
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE     thrpt   30  ops/ms   
> 1134.301  16053.847  14.15
> StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE    thrpt   30  ops/ms  
> 26449.558  28699.480  1.085
> StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE  thrpt   30  ops/ms   
> 1922.167   5781.077  3.007
> StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE   thrpt   30  ops/ms   
> 3784.190  11789.276  3.115
> StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE     thrpt   30  ops/ms   
> 3694.082  15633.547  4.232
> StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE    thrpt   30  ops/ms   
> 1966.956   6049.790  3.075
> StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE   thrpt   30  ops/ms   
> 7647.309  27412.387  3.584

This pull request has now been integrated.

Changeset: d48ddfe4
Author:    Xiaohong Gong <xg...@openjdk.org>
URL:       
https://git.openjdk.org/jdk/commit/d48ddfe49a4e0b07949912d3c91d6f4737658b3e
Stats:     213 lines in 7 files changed: 36 ins; 140 del; 37 mod

8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined

Reviewed-by: psandoz

-------------

PR: https://git.openjdk.org/jdk/pull/23817

Reply via email to