Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-27 Thread Jatin Bhateja
On Tue, 27 Jul 2021 02:52:13 GMT, Eric Liu  wrote:

>> @sviswa7, SLP flow will either have a constant 8bit shift value or a 
>> variable shift present in vector, this also include broadcasted non-constant 
>> shift value or a shift value beyond 8 bit.
>
> It would be better comment here, since the correctness relay on some others.

@theRealELiu , @sviswa7 , comment already exist in code,  I guess I mentioned 
incorrectly earlier on this thread, rectified my comments.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Eric Liu
On Mon, 26 Jul 2021 18:56:01 GMT, Jatin Bhateja  wrote:

>> @jatin-bhateja  This question is still pending.
>
> @sviswa7, SLP flow will either have a constant 8bit shift value or a variable 
> shift present in vector. So non constant scalar case will not be hit through 
> this route.

It would be better comment here, since the correctness relay on some others.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Jatin Bhateja
On Mon, 26 Jul 2021 17:19:07 GMT, Sandhya Viswanathan 
 wrote:

>> And'ing with shift_mask is already done on Java API side implementation 
>> before making a call to intrinsic rountine.
>
> @jatin-bhateja  This question is still pending.

@sviswa7, SLP flow will either have a constant 8bit shift value or a variable 
shift present in vector. So non constant scalar case will not be hit through 
this route.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Jatin Bhateja
On Mon, 26 Jul 2021 17:19:07 GMT, Sandhya Viswanathan 
 wrote:

>> And'ing with shift_mask is already done on Java API side implementation 
>> before making a call to intrinsic rountine.
>
> @jatin-bhateja  This question is still pending.

Other than VectorAPI , SLP also infers vector rotates where shift is either a 
8bit constant or variable shift present in vector. So this case of scalar 
non-constant shift will not be hit for non-vectorAPI case.
Also it will be illegal to perform any wrap around for shifts.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Sandhya Viswanathan
On Sun, 18 Jul 2021 20:22:18 GMT, Jatin Bhateja  wrote:

>> src/hotspot/share/opto/vectornode.cpp line 1180:
>> 
>>> 1178:   cnt = cnt->in(1);
>>> 1179: }
>>> 1180: shiftRCnt = cnt;
>> 
>> Why do we remove the And with mask here?
>
> And'ing with shift_mask is already done on Java API side implementation 
> before making a call to intrinsic rountine.

@jatin-bhateja  This question is still pending.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-18 Thread Jatin Bhateja
On Fri, 16 Jul 2021 00:52:21 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 15 commits:
>> 
>>  - 8266054: Incorporating styling changes based on reviews.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
>> JDK-8266054
>>  - 8266054: Removing redundant teat templates.
>>  - 8266054: Code reorganization for efficient sharing of logic to check 
>> rotate operation support on a target platform.
>>  - 8266054: Removing redundant test templates.
>>  - 8266054: Review comments resolution.
>>  - ... and 5 more: 
>> https://git.openjdk.java.net/jdk/compare/07e90524...609c4143
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 84:
> 
>> 82: arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) 
>> {
>> 83:   is_supported = true;
>> 84: }
> 
> When check_bcast is set, is_supported could be false when replicate is not 
> supported. Is replicate not needed for shift+or sequence?

check_bcast is true only when shift value is a non-constant scalar value, in 
that case we need to check for broadcasting operation for shift, in all other 
cases broadcast is not needed.  Constant shift value is an optimizing case 
since AVX512 offers instructions which directly accept 8bit immediate shift 
value.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 86:
> 
>> 84: }
>> 85: return is_supported;
>> 86: }
> 
> Please add comments here why the Left/Right shift and Or opcodes are being 
> checked here. Also add comments why for left shift we are only checking for 
> int and long left shift opcodes whereas for right shift sub word opcodes are 
> being checked.

Both left and right shifts opcodes are selected for all integral types 
(byte/short/int/long). VectorNode::opcode returns the granular left shift type 
based on the sub-type i.e. elem_bt in case of  LeftShiftI.  Re-organizing the 
code for better readability.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 338:
> 
>> 336: // TODO When mask usage is supported, VecMaskNotUsed needs to be 
>> VecMaskUseLoad.
>> 337: if ((sopc != 0) &&
>> 338: !arch_supports_vector(sopc, num_elem, elem_bt, 
>> is_vector_mask(vbox_klass) ? VecMaskUseAll : VecMaskNotUsed)) {
> 
> Could we not call arch_supports_vector_rotate from arch_supports_vector?

DONE

> src/hotspot/share/opto/vectorIntrinsics.cpp line 1563:
> 
>> 1561:  -0x80 <= cnt_type->get_con() && 
>> cnt_type->get_con() < 0x80;
>> 1562:   if (is_rotate) {
>> 1563: if (!arch_supports_vector_rotate(sopc, num_elem, elem_bt, 
>> !is_const_rotate)) {
> 
> What is the relationship between check_bcast and !is_const_rotate? Some 
> comments here on this would help.

Constant shift value is an optimizing case since AVX512 offers instructions 
which directly accept constant shifts in the range (-256, 255). Similar 
handling is done in SLP.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2493

But I feel this is very X86 specific check in generic code, so moving decision 
to a new target specific matcher routine.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 1590:
> 
>> 1588:   opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, 
>> type_bt));
>> 1589: } else {
>> 1590:   // constant shift.
> 
> Did you mean constant rotate here?

Yes.

> src/hotspot/share/opto/vectornode.cpp line 1180:
> 
>> 1178:   cnt = cnt->in(1);
>> 1179: }
>> 1180: shiftRCnt = cnt;
> 
> Why do we remove the And with mask here?

And'ing with shift_mask is already done on Java API side implementation before 
making a call to intrinsic rountine.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-15 Thread Sandhya Viswanathan
On Thu, 15 Jul 2021 08:34:42 GMT, Jatin Bhateja  wrote:

>> Current VectorAPI Java side implementation expresses rotateLeft and 
>> rotateRight operation using following operations:-
>> 
>> vec1 = lanewise(VectorOperators.LSHL, n)
>> vec2 = lanewise(VectorOperators.LSHR, n)
>> res = lanewise(VectorOperations.OR, vec1 , vec2)
>> 
>> This patch moves above handling from Java side to C2 compiler which 
>> facilitates dismantling the rotate operation if target ISA does not support 
>> a direct rotate instruction.
>> 
>> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
>> long and integer type vectors. For other cases (i.e. sub-word type vectors 
>> or for targets which do not support direct rotate operations )   instruction 
>> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
>> 
>> Please find below the performance data for included JMH benchmark.
>> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>> 
>> 
>> Benchmark | (TESTSIZE) | Shift | Baseline AVX3 (ops/ms) | Withopt  AVX3 
>> (ops/ms) | Gain % | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain %
>> -- | -- | -- | -- | -- | -- | -- | -- | --
>>   |   |   |   |   |   |   |   |  
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 17223.35 | 17094.69 | 
>> -0.75 | 17008.32 | 17488.06 | 2.82
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 8944.98 | 8811.34 | -1.49 
>> | 8878.17 | 9218.68 | 3.84
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 17195.75 | 17137.32 | 
>> -0.34 | 16789.01 | 17780.34 | 5.90
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 9052.67 | 8838.60 | -2.36 
>> | 8814.62 | 9206.01 | 4.44
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 17100.19 | 16950.64 | 
>> -0.87 | 16827.73 | 17720.37 | 5.30
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 9079.95 | 8471.26 | -6.70 
>> | .44 | 9167.68 | 3.14
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 21231.33 | 21513.08 | 1.33 
>> | 21824.51 | 21479.48 | -1.58
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 11103.62 | 11180.16 | 0.69 
>> | 11173.67 | 11529.22 | 3.18
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 21119.14 | 21552.04 | 
>> 2.05 | 21693.05 | 21915.37 | 1.02
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 11048.68 | 11094.20 | 
>> 0.41 | 11049.90 | 11439.07 | 3.52
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 21506.31 | 21391.41 | 
>> -0.53 | 21263.18 | 21986.29 | 3.40
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 11056.12 | 11232.78 | 
>> 1.60 | 10941.59 | 11397.09 | 4.16
>> RotateBenchmark.testRotateLeftB | 512.00 | 7.00 | 17976.56 | 18180.85 | 1.14 
>> | 1212.26 | 2533.34 | 108.98
>> RotateBenchmark.testRotateLeftB | 512.00 | 15.00 | 17553.70 | 18219.07 | 
>> 3.79 | 1256.73 | 2537.41 | 101.91
>> RotateBenchmark.testRotateLeftB | 512.00 | 31.00 | 17618.03 | 17738.15 | 
>> 0.68 | 1214.69 | 2533.83 | 108.60
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 7258.87 | 7468.88 | 2.89 | 
>> 7115.12 | 7117.26 | 0.03
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 3586.65 | 3950.85 | 10.15 
>> | 3532.17 | 3595.80 | 1.80
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 1835.07 | 1999.68 | 8.97 | 
>> 1789.90 | 1819.93 | 1.68
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 7273.36 | 7410.91 | 1.89 
>> | 7198.60 | 6994.79 | -2.83
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 3674.98 | 3926.27 | 6.84 
>> | 3549.90 | 3755.09 | 5.78
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 1840.94 | 1882.25 | 2.24 
>> | 1801.56 | 1872.89 | 3.96
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 7457.11 | 7361.48 | -1.28 
>> | 6975.33 | 7385.94 | 5.89
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 3570.74 | 3929.30 | 10.04 
>> | 3635.37 | 3736.67 | 2.79
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 1902.32 | 1960.46 | 3.06 
>> | 1812.32 | 1813.88 | 0.09
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 11174.24 | 12044.52 | 7.79 
>> | 11509.87 | 11273.44 | -2.05
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 5981.47 | 6073.70 | 1.54 | 
>> 5593.66 | 5661.93 | 1.22
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 2932.49 | 3069.54 | 4.67 | 
>> 2950.86 | 2892.42 | -1.98
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 11764.11 | 12098.63 | 
>> 2.84 | 11069.52 | 11476.93 | 3.68
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 5855.20 | 6080.40 | 3.85 
>> | 5919.11 | 5607.04 | -5.27
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 2989.05 | 3048.56 | 1.99 
>> | 2902.63 | 2821.83 | -2.78
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 11652.84 | 11965.40 | 
>> 2.68 | 11525.62 | 11459.83 | -0.57
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 5851.82 | 6164.94 | 5.35 
>> | 5882.60 | 5842.30 | -0.69
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 3015.99 | 3043.79 | 0.92 
>> | 2963.71 | 2947.97 | -0.53
>> RotateBenchmark.testRotateLeftI | 512.00 | 7.00 | 

Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-15 Thread Jatin Bhateja
> Current VectorAPI Java side implementation expresses rotateLeft and 
> rotateRight operation using following operations:-
> 
> vec1 = lanewise(VectorOperators.LSHL, n)
> vec2 = lanewise(VectorOperators.LSHR, n)
> res = lanewise(VectorOperations.OR, vec1 , vec2)
> 
> This patch moves above handling from Java side to C2 compiler which 
> facilitates dismantling the rotate operation if target ISA does not support a 
> direct rotate instruction.
> 
> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
> long and integer type vectors. For other cases (i.e. sub-word type vectors or 
> for targets which do not support direct rotate operations )   instruction 
> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
> 
> Please find below the performance data for included JMH benchmark.
> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
> 
> 
> Benchmark | (TESTSIZE) | Shift | Baseline AVX3 (ops/ms) | Withopt  AVX3 
> (ops/ms) | Gain % | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain %
> -- | -- | -- | -- | -- | -- | -- | -- | --
>   |   |   |   |   |   |   |   |  
> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 17223.35 | 17094.69 | -0.75 
> | 17008.32 | 17488.06 | 2.82
> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 8944.98 | 8811.34 | -1.49 | 
> 8878.17 | 9218.68 | 3.84
> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 17195.75 | 17137.32 | 
> -0.34 | 16789.01 | 17780.34 | 5.90
> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 9052.67 | 8838.60 | -2.36 
> | 8814.62 | 9206.01 | 4.44
> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 17100.19 | 16950.64 | 
> -0.87 | 16827.73 | 17720.37 | 5.30
> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 9079.95 | 8471.26 | -6.70 
> | .44 | 9167.68 | 3.14
> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 21231.33 | 21513.08 | 1.33 
> | 21824.51 | 21479.48 | -1.58
> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 11103.62 | 11180.16 | 0.69 
> | 11173.67 | 11529.22 | 3.18
> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 21119.14 | 21552.04 | 2.05 
> | 21693.05 | 21915.37 | 1.02
> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 11048.68 | 11094.20 | 0.41 
> | 11049.90 | 11439.07 | 3.52
> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 21506.31 | 21391.41 | 
> -0.53 | 21263.18 | 21986.29 | 3.40
> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 11056.12 | 11232.78 | 1.60 
> | 10941.59 | 11397.09 | 4.16
> RotateBenchmark.testRotateLeftB | 512.00 | 7.00 | 17976.56 | 18180.85 | 1.14 
> | 1212.26 | 2533.34 | 108.98
> RotateBenchmark.testRotateLeftB | 512.00 | 15.00 | 17553.70 | 18219.07 | 3.79 
> | 1256.73 | 2537.41 | 101.91
> RotateBenchmark.testRotateLeftB | 512.00 | 31.00 | 17618.03 | 17738.15 | 0.68 
> | 1214.69 | 2533.83 | 108.60
> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 7258.87 | 7468.88 | 2.89 | 
> 7115.12 | 7117.26 | 0.03
> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 3586.65 | 3950.85 | 10.15 | 
> 3532.17 | 3595.80 | 1.80
> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 1835.07 | 1999.68 | 8.97 | 
> 1789.90 | 1819.93 | 1.68
> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 7273.36 | 7410.91 | 1.89 | 
> 7198.60 | 6994.79 | -2.83
> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 3674.98 | 3926.27 | 6.84 | 
> 3549.90 | 3755.09 | 5.78
> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 1840.94 | 1882.25 | 2.24 | 
> 1801.56 | 1872.89 | 3.96
> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 7457.11 | 7361.48 | -1.28 
> | 6975.33 | 7385.94 | 5.89
> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 3570.74 | 3929.30 | 10.04 
> | 3635.37 | 3736.67 | 2.79
> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 1902.32 | 1960.46 | 3.06 | 
> 1812.32 | 1813.88 | 0.09
> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 11174.24 | 12044.52 | 7.79 
> | 11509.87 | 11273.44 | -2.05
> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 5981.47 | 6073.70 | 1.54 | 
> 5593.66 | 5661.93 | 1.22
> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 2932.49 | 3069.54 | 4.67 | 
> 2950.86 | 2892.42 | -1.98
> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 11764.11 | 12098.63 | 2.84 
> | 11069.52 | 11476.93 | 3.68
> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 5855.20 | 6080.40 | 3.85 | 
> 5919.11 | 5607.04 | -5.27
> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 2989.05 | 3048.56 | 1.99 | 
> 2902.63 | 2821.83 | -2.78
> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 11652.84 | 11965.40 | 2.68 
> | 11525.62 | 11459.83 | -0.57
> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 5851.82 | 6164.94 | 5.35 | 
> 5882.60 | 5842.30 | -0.69
> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 3015.99 | 3043.79 | 0.92 | 
> 2963.71 | 2947.97 | -0.53
> RotateBenchmark.testRotateLeftI | 512.00 | 7.00 | 16029.15 | 16189.79 | 1.00 
> | 860.43 | 2339.32 | 171.88
> RotateBenchmark.testRotateLeftI | 512.00 | 7.00 | 8078.25 | 8081.84 | 0.04 | 
> 427.39