On Thu, 27 Feb 2025 23:30:29 GMT, Paul Sandoz <psan...@openjdk.org> wrote:

>> Method `checkMaskFromIndexSize` is called by some vector masked APIs like 
>> `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the 
>> index of any active lanes in a mask will reach out of the boundary of the 
>> given Array/MemorySegment. This function should be force inlined, or a 
>> VectorMask object is generated once the function call is not inlined by C2 
>> compiler, which affects the API performance a lot.
>> 
>> This patch changed to call the `VectorMask.checkFromIndexSize` method 
>> directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it 
>> has added the `@ForceInline` annotation already, it will be inlined and 
>> intrinsified by C2. And then the expected vector instructions can be 
>> generated. With this change, the unused `checkMaskFromIndexSize` can be 
>> removed.
>> 
>> Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace 
>> CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar 
>> performance improvement on a Intel CPU which supports AVX512.
>> 
>> Following is the performance data on Grace:
>> 
>> 
>> Benchmark                                             Mode  Cnt  Units     
>> Before      After   Gain
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE      thrpt   30  ops/ms  
>> 31544.304  31610.598  1.002
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE    thrpt   30  ops/ms   
>> 3896.202   3903.249  1.001
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE     thrpt   30  ops/ms    
>> 570.415   7174.320  12.57
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE       thrpt   30  ops/ms    
>> 566.694   7193.520  12.69
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE      thrpt   30  ops/ms   
>> 3899.269   3878.258  0.994
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE     thrpt   30  ops/ms   
>> 1134.301  16053.847  14.15
>> StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE    thrpt   30  ops/ms  
>> 26449.558  28699.480  1.085
>> StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE  thrpt   30  ops/ms   
>> 1922.167   5781.077  3.007
>> StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE   thrpt   30  ops/ms   
>> 3784.190  11789.276  3.115
>> StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE     thrpt   30  ops/ms   
>> 3694.082  15633.547  4.232
>> StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE    thrpt   30  ops/ms   
>> 1966.956   6049.790  3.075
>> StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE   thrpt   30  ops/ms   
>> 7647.309  27412.387  3.584
>
> Marked as reviewed by psandoz (Reviewer).

Thanks a lot for your review @PaulSandoz !

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23817#issuecomment-2693122599

Reply via email to