On Thu, 2 Feb 2023 18:07:59 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi, I modified a version by using the old implementation for the tail loop 
>> instead of adding the new intrinsics. The code looks like:
>> 
>> public VectorMask<E> indexInRange(int offset, int limit) {
>>         int vlength = length();
>>         if (offset >= 0 && limit - offset >= length()) {
>>             return this;
>>         } else if (offset >= limit) {
>>             return vectorSpecies().maskAll(false);
>>         }
>> 
>>         Vector<E> iota = vectorSpecies().zero().addIndex(1);
>>         VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
>>         return this.andNot(badMask);
>>     }
>> 
>> And I tested the performance of the new added benchmarks with different 
>> vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show 
>> that the performance of changed version is not better than the current 
>> version, if the array size is not aligned with the vector size, especially 
>> for the double/long type with larger size.
>> 
>> Here are some raw data with NEON:
>> 
>> Benchmark                                 (size)   Mode  Cnt  current   
>> modified   Units
>> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  2654.919  
>> 2584.423   ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  1830.364  
>> 1802.876   ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  1058.548  
>> 1073.742   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   594.920   
>> 593.832   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   308.678   
>> 149.279   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   160.639    
>> 74.579   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  1097.567  
>> 1104.008   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   617.845   
>> 606.886   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   315.978   
>> 152.046   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  1165.279  
>> 1205.486   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5   633.648   
>> 631.672   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   315.370   
>> 154.144   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   639.840   
>> 633.623   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   312.267   
>> 152.788   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   163.028    
>> 78.150   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  1834.318  
>> 1800.318   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  1105.695  
>> 1094.347   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   602.442   
>> 599.827   ops/ms
>> 
>> 
>> And the data with SVE 256-bit vector size:
>> 
>> Benchmark                                 (size)   Mode  Cnt   current   
>> modified Units
>> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  23772.370 
>> 22921.113 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  18930.388 
>> 17920.910 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  13528.610 
>> 13282.504 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   7850.522  
>> 7975.720 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   4281.749  
>> 4373.926 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   2160.001   
>> 604.458 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  13594.943 
>> 13306.904 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   8163.134  
>> 7912.343 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   4335.529  
>> 4198.555 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  22106.880 
>> 20348.266 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5  11711.588 
>> 10958.299 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   5501.034  
>> 5358.441 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   9832.578  
>> 9829.398 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   4979.326  
>> 4947.166 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   2269.131   
>> 614.204 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  19865.866 
>> 19297.628 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  14005.214 
>> 13592.407 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   8766.450  
>> 8531.675 ops/ms
>> 
>> As a conclusion, I prefer to keep the current version. WDYT?
>
> Yes, I agree, I did collect performance data with your benchmarks at various 
> AVX levels and see significant gains. Cause of performance variation b/w 
> integer and floating point cases is due to below limitation.
> https://github.com/openjdk/jdk/pull/12064#discussion_r1094101761
> Which can be addressed in a separate PR.
> 
> FTR here are performance numbers at UseAVX=3
> 
> 
> Benchmark                                 (size)   Mode  Cnt      Score   
> Error   Units
> IndexInRangeBenchmark.byteIndexInRange      1024  thrpt    2  74983.406       
>    ops/ms
> IndexInRangeBenchmark.byteIndexInRange      1027  thrpt    2  19156.962       
>    ops/ms
> IndexInRangeBenchmark.doubleIndexInRange    1024  thrpt    2  11368.179       
>    ops/ms
> IndexInRangeBenchmark.doubleIndexInRange    1027  thrpt    2   2165.207       
>    ops/ms
> IndexInRangeBenchmark.floatIndexInRange     1024  thrpt    2  18736.787       
>    ops/ms
> IndexInRangeBenchmark.floatIndexInRange     1027  thrpt    2   3798.996       
>    ops/ms
> IndexInRangeBenchmark.intIndexInRange       1024  thrpt    2  18797.863       
>    ops/ms
> IndexInRangeBenchmark.intIndexInRange       1027  thrpt    2   5455.317       
>    ops/ms
> IndexInRangeBenchmark.longIndexInRange      1024  thrpt    2  11866.493       
>    ops/ms
> IndexInRangeBenchmark.longIndexInRange      1027  thrpt    2   2227.896       
>    ops/ms
> IndexInRangeBenchmark.shortIndexInRange     1024  thrpt    2  46921.520       
>    ops/ms
> IndexInRangeBenchmark.shortIndexInRange     1027  thrpt    2   8532.394       
>    ops/ms

Thanks for the performance testing on x86 systems! Agree that a separate PR is 
fine, and I will address it once this PR merged.

-------------

PR: https://git.openjdk.org/jdk/pull/12064

Reply via email to