Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v34]

Jatin Bhateja Mon, 11 Sep 2023 11:29:41 -0700

On Fri, 8 Sep 2023 18:10:33 GMT, Srinivas Vamsi Parasa <[email protected]> wrote:


>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |    Arrays.sort benchmark   |       Array Size      |       Baseline 
>> (us/op)        |       AVX512 Sort (us/op)     |       Speedup |
>> |    ---     |       ---     |       ---     |       ---     |       ---     
>> |
>> |    ArraysSort.doubleSort   |       10      |       0.034   |       0.035   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       25      |       0.116   |       0.089   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       50      |       0.282   |       0.291   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       75      |       0.474   |       0.358   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       100     |       0.654   |       0.623   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       1000    |       9.274   |       6.331   
>> |       1.5     |
>> |    ArraysSort.doubleSort   |       10000   |       323.339 |       71.228  
>> |       **4.5** |
>> |    ArraysSort.doubleSort   |       100000  |       4471.871        |       
>> 1002.748        |       **4.5** |
>> |    ArraysSort.doubleSort   |       1000000 |       51660.742       |       
>> 12921.295       |       **4.0** |
>> |    ArraysSort.floatSort    |       10      |       0.045   |       0.046   
>> |       1.0     |
>> |    ArraysSort.floatSort    |       25      |       0.103   |       0.084   
>> |       1.2     |
>> |    ArraysSort.floatSort    |       50      |       0.285   |       0.33    
>> |       0.9     |
>> |    ArraysSort.floatSort    |       75      |       0.492   |       0.346   
>> |       1.4     |
>> |    ArraysSort.floatSort    |       100     |       0.597   |       0.326   
>> |       1.8     |
>> |    ArraysSort.floatSort    |       1000    |       9.811   |       5.294   
>> |       1.9     |
>> |    ArraysSort.floatSort    |       10000   |       323.955 |       50.547  
>> |       **6.4** |
>> |    ArraysSort.floatSort    |       100000  |       4326.38 |       731.152 
>> |       **5.9** |
>> |    ArraysSort.floatSort    |       1000000 |       52413.88        |       
>> 8409.193        |       **6.2** |
>> |    ArraysSort.intSort      |       10      |       0.033   |       0.033   
>> |       1.0     |
>> |    ArraysSort.intSort      |       25      |       0.086   |       0.051   
>> |       1.7     |
>> |    ArraysSort.intSort      |       50      |       0.236   |       0.151   
>> |       1.6     |
>> |    ArraysSort.intSort      |       75      |       0.416   |       0.332   
>> |       1.3     |
>> |    ArraysSort.intSort      |       100     |       0.63    |       0.521   
>> |       1.2     |
>> |    ArraysSort.intSort      |       1000    |       10.518  |       4.698   
>> |       2.2     |
>> |    ArraysSort.intSort      |       10000   |       309.659 |       42.518  
>> |       **7.3** |
>> |    ArraysSort.intSort      |       100000  |       4130.917        |       
>> 573.956 |       **7.2** |
>> |    ArraysSort.intSort      |       1000000 |       49876.307       |       
>> 6712.812        |       **7.4** |
>> |    ArraysSort.longSort     |       10      |       0.036   |       0.037   
>> |       1.0     |
>> |    ArraysSort.longSort     |       25      |       0.094   |       0.08    
>> |       1.2     |
>> |    ArraysSort.longSort     |       50      |       0.218   |       0.227   
>> |       1.0     |
>> |    ArraysSort.longSort     |       75      |       0.466   |       0.402   
>> |       1.2     |
>> |    ArraysSort.longSort     |       100     |       0.76    |       0.58    
>> |       1.3     |
>> |    ArraysSort.longSort     |       1000    |       10.449  |       6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fix regression when intrinsics are disabled; enable insertion sort in 
> intrinsic, change library name to libsimdsort

Overall patch looks good to me, apart from a minor refactoring suggestion.

> > > 
> > 
> > 
> > Hi @vamsi-parasa ,
> > Please find below the perf data collected over “Linux” with following JMH 
> > options. Idea here is to measure the impact of Java side code 
> > re-structuring over non-x86 targets and windows machines which do not 
> > intrinsify newly created primitives.
> > java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 
> > -jvmArgs "-XX:+UnlockDiagnosticVMOptions 
> > -XX:DisableIntrinsic=_arraySort,_arrayPartition" ArraysSort.Long.testSort
> > Baseline numbers are with stock JDK.
> > ![image](https://user-images.githubusercontent.com/59989778/264077119-d3bf2591-38bb-4924-b77d-6889c5dbc3c0.png)
> > It will also be good to mention in PR description as to why are not 
> > accelerating sorting for short/char arrays when x86-simd-sort does have 
> > avx512 optimized versions for 16-bit arrays sorting.
> > Best Regards, Jatin
> 
> Hello Jatin,
> 
> In the table below, please see the performance data (on Linux x86 machine) 
> when the intrinsics are disabled w.r.t to the stock JDK as baseline. The 
> regression is fixed in the latest commit pushed.
> 
> This data was collected using the same way as yours: `java -jar 
> target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs 
> "-XX:+UnlockDiagnosticVMOptions 
> -XX:DisableIntrinsic=_arraySortI,_arraySortMI,_arrayPartitionSP,_arrayPartitionDP"
>  ArraysSort.Long.testSort`
> 
> Please note the new names of the intrinsics.
> 
> Benchmark     (builder)       (size)  Baseline (Stock JDK) us/op      RSD 
> (baseline)  AVX512 sort (intrinsics disabled) us/op RSD (AVX512 sort)       
> Speedup
> ArraysSort.Long.testSort      RANDOM  800     6.4     0.1%    6.3     0.43%   
> 1.02
> ArraysSort.Long.testSort      RANDOM  7000    210.2   0.5%    205.5   0.24%   
> 1.02
> ArraysSort.Long.testSort      RANDOM  50000   2087.2  0.1%    2057.5  0.18%   
> 1.01
> ArraysSort.Long.testSort      RANDOM  300000  14076.0 0.3%    14245.2 0.13%   
> 0.99
> ArraysSort.Long.testSort      RANDOM  2000000 111576.6        0.4%    
> 110024.9        0.04%   1.01
> Thanks, Vamsi

Thanks @vamsi-parasa  for addressing this.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4205:

> 4203: 
> 4204:       snprintf(ebuf_, sizeof(ebuf_), "avx512_sort_double");
> 4205:       StubRoutines::_arraysort_double = 
> (address)os::dll_lookup(libsimdsort, ebuf_);

Hi @vamsi-parasa , If we align these C++ stub entry points with java intrinsic 
entry points which accept an element class argument, it will significantly 
cleanup the code referring to these stub entry points. We can pass an extra int 
type flag to stubs based on which C++ implementation can appropriately call 
type specific leaf level routines.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1620565021
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1714374347
PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1321907992

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v34]

Reply via email to