Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]

2022-06-07 Thread Jatin Bhateja
On Tue, 7 Jun 2022 04:29:40 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

LGTM.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-07 Thread Jatin Bhateja
On Tue, 7 Jun 2022 02:22:53 GMT, Xiaohong Gong  wrote:

>> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>>  line 97:
>> 
>>> 95: public void byteLoadArrayMaskIOOBE() {
>>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, 
>>> i);
>> 
>> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
>> )" we are anyways intrinsifying, should we limit this micro to work only for 
>> newly optimized case.
>
> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
> for the IOOBE cases. I locally modified the tests to make sure only the IOOBE 
> case happens and the results show good as well. But do you think it's better 
> to keep as it is since we can also see the performance of the common cases to 
> make sure no regressions happen? As the current benchmarks can also show the 
> performance gain by this PR.

It was just to remove the noise from a targeted micro benchmark. But we can 
keep it as it is.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Jatin Bhateja
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java 
line 97:

> 95: public void byteLoadArrayMaskIOOBE() {
> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);

For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" 
we are anyways intrinsifying, should we limit this micro to work only for newly 
optimized case.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Integrated: 8284960: Integration of JEP 426: Vector API (Fourth Incubator)

2022-05-31 Thread Jatin Bhateja
On Wed, 27 Apr 2022 11:03:48 GMT, Jatin Bhateja  wrote:

> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

This pull request has now been integrated.

Changeset: 6f6486e9
Author:Jatin Bhateja 
URL:   
https://git.openjdk.java.net/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a
Stats: 38021 lines in 228 files changed: 16652 ins; 16924 del; 4445 mod

8284960: Integration of JEP 426: Vector API (Fourth Incubator)

Co-authored-by: Jatin Bhateja 
Co-authored-by: Paul Sandoz 
Co-authored-by: Sandhya Viswanathan 
Co-authored-by: Smita Kamath 
Co-authored-by: Joshua Zhu 
Co-authored-by: Xiaohong Gong 
Co-authored-by: John R Rose 
Co-authored-by: Eric Liu 
Co-authored-by: Ningsheng Jian 
Reviewed-by: ngasson, vlivanov, mcimadamore, jlahoda, kvn

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v10]

2022-05-31 Thread Jatin Bhateja
On Wed, 25 May 2022 06:29:23 GMT, Jatin Bhateja  wrote:

>> Hi All,
>> 
>> Patch adds the planned support for new vector operations and APIs targeted 
>> for [JEP 426: Vector API (Fourth 
>> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
>> 
>> Following is the brief summary of changes:-
>> 
>> 1)  Extends the scope of existing lanewise API for following new vector 
>> operations.
>>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
>> bits
>>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
>> zero bits
>>- VectorOperations.REVERSE: reversing the order of bits
>>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>>- compress and expand bits: Semantics are based on Hacker's Delight 
>> section 7-4 Compress, or Generalized Extract.
>> 
>> 2)  Adds following new APIs to perform cross lane vector compress and 
>> expansion operations under the influence of a mask.
>>- Vector.compress
>>- Vector.expand 
>>- VectorMask.compress
>> 
>> 3) Adds predicated and non-predicated versions of following new APIs to load 
>> and store the contents of vector from foreign MemorySegments. 
>>   - Vector.fromMemorySegment
>>   - Vector.intoMemorySegment
>> 
>> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
>> for each newly added operation.
>> 
>> 
>>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
>> 
>>  Kindly review and share your feedback.
>> 
>>  Best Regards,
>>  Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 20 commits:
> 
>  - 8284960: Post merge cleanups.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - 8284960: Review comments resolved.
>  - 8284960: Integrating incremental patches.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - 8284960: Changes to enable jdk.incubator.vector to be treated as preview 
> participant. Code re-organization related to Reverse/ReverseByte IR 
> transforms.
>  - 8284960: Adding --enable-preview in vectorAPI benchmarks.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - 8284960: Review comments resolution.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - ... and 10 more: 
> https://git.openjdk.java.net/jdk/compare/742644e2...0f6e1584

Thanks reviewers for your comments.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8]

2022-05-26 Thread Jatin Bhateja
On Wed, 25 May 2022 06:25:53 GMT, Jatin Bhateja  wrote:

>> src/hotspot/cpu/x86/assembler_x86.cpp line 8173:
>> 
>>> 8171: 
>>> 8172: void Assembler::vinsertf32x4(XMMRegister dst, XMMRegister nds, 
>>> XMMRegister src, uint8_t imm8) {
>>> 8173:   assert(VM_Version::supports_evex(), "");
>> 
>> Hmm, did we never trigger this wrong assert because the use was guarded by 
>> correct check?
>
> Yes.

> @jatin-bhateja something wrong with merge. `vpadd()` is removed. It was added 
> by #8778 and still is used in `x86.ad`.

Hi @vnkozlov , after integration of PR 8778 there were there were two copies of 
vpadd with same signature, so removed one of them.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v9]

2022-05-25 Thread Jatin Bhateja
On Wed, 25 May 2022 05:50:23 GMT, Jatin Bhateja  wrote:

>> Hi All,
>> 
>> Patch adds the planned support for new vector operations and APIs targeted 
>> for [JEP 426: Vector API (Fourth 
>> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
>> 
>> Following is the brief summary of changes:-
>> 
>> 1)  Extends the scope of existing lanewise API for following new vector 
>> operations.
>>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
>> bits
>>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
>> zero bits
>>- VectorOperations.REVERSE: reversing the order of bits
>>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>>- compress and expand bits: Semantics are based on Hacker's Delight 
>> section 7-4 Compress, or Generalized Extract.
>> 
>> 2)  Adds following new APIs to perform cross lane vector compress and 
>> expansion operations under the influence of a mask.
>>- Vector.compress
>>- Vector.expand 
>>- VectorMask.compress
>> 
>> 3) Adds predicated and non-predicated versions of following new APIs to load 
>> and store the contents of vector from foreign MemorySegments. 
>>   - Vector.fromMemorySegment
>>   - Vector.intoMemorySegment
>> 
>> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
>> for each newly added operation.
>> 
>> 
>>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
>> 
>>  Kindly review and share your feedback.
>> 
>>  Best Regards,
>>  Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8284960: Review comments resolved.

Hi @vnkozlov , Your comments have been addressed.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8]

2022-05-25 Thread Jatin Bhateja
On Mon, 23 May 2022 22:17:40 GMT, Vladimir Kozlov  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8284960: Integrating incremental patches.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 8173:
> 
>> 8171: 
>> 8172: void Assembler::vinsertf32x4(XMMRegister dst, XMMRegister nds, 
>> XMMRegister src, uint8_t imm8) {
>> 8173:   assert(VM_Version::supports_evex(), "");
> 
> Hmm, did we never trigger this wrong assert because the use was guarded by 
> correct check?

Yes.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v10]

2022-05-25 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 20 commits:

 - 8284960: Post merge cleanups.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Review comments resolved.
 - 8284960: Integrating incremental patches.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Changes to enable jdk.incubator.vector to be treated as preview 
participant. Code re-organization related to Reverse/ReverseByte IR transforms.
 - 8284960: Adding --enable-preview in vectorAPI benchmarks.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Review comments resolution.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - ... and 10 more: https://git.openjdk.java.net/jdk/compare/742644e2...0f6e1584

-

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=09
  Stats: 38021 lines in 228 files changed: 16652 ins; 16924 del; 4445 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v9]

2022-05-24 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8284960: Review comments resolved.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8425/files
  - new: https://git.openjdk.java.net/jdk/pull/8425/files/17a0e38c..a2c9673d

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8425=08
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8425=07-08

  Stats: 110 lines in 7 files changed: 42 ins; 31 del; 37 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-23 Thread Jatin Bhateja
On Thu, 12 May 2022 23:56:49 GMT, Vladimir Ivanov  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 11 commits:
>> 
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: Correcting a typo.
>>  - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: AARCH64 backend changes.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - ... and 1 more: 
>> https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082
>
> Overall, looks good.
> 
> Some minor questions/suggestions follow.

Hi @iwanowww , your comments have been addressed. kindly let me know if you 
have other comments on x86 side changes.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7]

2022-05-20 Thread Jatin Bhateja
On Thu, 19 May 2022 21:19:49 GMT, Paul Sandoz  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: Changes to enable jdk.incubator.vector to be treated as preview 
>> participant. Code re-organization related to Reverse/ReverseByte IR 
>> transforms.
>>  - 8284960: Adding --enable-preview in vectorAPI benchmarks.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: Review comments resolution.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: Correcting a typo.
>>  - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - ... and 6 more: 
>> https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233
>
> src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 50:
> 
>> 48: import java.util.Set;
>> 49: 
>> 50: import static com.sun.tools.javac.code.Flags.PREVIEW_API;
> 
> Suggestion:
> 
> 
> Redundant import (sorry i should have checked before i sent you updates to 
> this area)

Merged

> src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 132:
> 
>> 130:  * @return true if {@code s} is participating in the preview of 
>> {@code previewSymbol}
>> 131:  */
>> 132: public boolean isPreviewParticipating(Symbol s, Symbol 
>> previewSymbol) {
> 
> Some feedback from a colleague:
> Suggestion:
> 
> /**
>  * Returns true if {@code s} is deemed to participate in the preview of 
> {@code previewSymbol}, and
>  * therefore no warnings or errors will be produced.
>  *
>  * @param s the symbol depending on the preview symbol
>  * @param previewSymbol the preview symbol marked with @Preview
>  * @return true if {@code s} is participating in the preview of {@code 
> previewSymbol}
>  */
> public boolean participatesInPreview(Symbol s, Symbol previewSymbol) {

Merged.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8]

2022-05-20 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8284960: Integrating incremental patches.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8425/files
  - new: https://git.openjdk.java.net/jdk/pull/8425/files/311f3233..17a0e38c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8425=07
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8425=06-07

  Stats: 32 lines in 7 files changed: 0 ins; 26 del; 6 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-19 Thread Jatin Bhateja
On Thu, 19 May 2022 15:33:49 GMT, Jatin Bhateja  wrote:

>> Do you mean it's important to apply the transformation at the right node 
>> (pick the right node as the root) and it is hard to make a decision during 
>> GVN?
>
> Yes, that what I meant, but with recently added 
> Node::Flag_is_predicated_using_blend it could be possible to move this 
> transformation ahead into idealization routines of reverse/reverse bytes IR 
> nodes.

Addressed this after internally discussing with Sandhya. Moved the transforms 
from final graph re-shaping back to vector intrinsic routines.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7]

2022-05-19 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 16 commits:

 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Changes to enable jdk.incubator.vector to be treated as preview 
participant. Code re-organization related to Reverse/ReverseByte IR transforms.
 - 8284960: Adding --enable-preview in vectorAPI benchmarks.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Review comments resolution.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Correcting a typo.
 - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233

-

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=06
  Stats: 38049 lines in 228 files changed: 16683 ins; 16923 del; 4443 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-19 Thread Jatin Bhateja
On Wed, 18 May 2022 23:35:54 GMT, Vladimir Ivanov  wrote:

>> It was an attempt to facilitate in-lining of these APIs over targets which 
>> do not intrinsify them. I agree its not a generic fix since three APIs are 
>> piggybacking on same entry point and without the knowledge of opcode it will 
>> be inappropriate to take any call at this place, lazy intrinsification gives 
>> opportunity for some of the predications to concertize as compilation 
>> happens under closed world assumptions.
>
> Still not clear why the code is shaped the way it is.
> 
> `Matcher::match_rule_supported_vector()` already checks that there are 
> relevant matching rules.
> 
> The checks require both `CompressM` and `CompressV` to be present to enable 
> the intrinsic. Is it important?
> 
> Also, it doesn't take `EnableVectorSupport` into account while all other 
> vector intrinsics respect it.

Yes, the code was modified to accommodate your comments. 
https://github.com/openjdk/jdk/pull/8425/files#diff-a9dd7e411772c1ee37b54c5ab868a01fe82af905758350f0ba1c370f422c3fe6R718

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-19 Thread Jatin Bhateja
On Wed, 18 May 2022 23:28:22 GMT, Vladimir Ivanov  wrote:

>> Its more of a chicken-egg problem here, for masked reverse operation, 
>> Reverse IR node is followed by a Blend Node, thus in such a case doing an 
>> eager Identity transform in Reverse::Identity will not work, also deferring 
>> this to blend may also not work since it could be a non-masked reverse 
>> operation, we could have handled it as a special case in 
>> inline_vector_nary_operation, but handling such special case in final graph 
>> reshaping looked more appropriate.
>> 
>> https://github.com/openjdk/panama-vector/pull/182#discussion_r845678080
>
> Do you mean it's important to apply the transformation at the right node 
> (pick the right node as the root) and it is hard to make a decision during 
> GVN?

Yes, that what I meant.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6]

2022-05-17 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8284960: Adding --enable-preview in vectorAPI benchmarks.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8425/files
  - new: https://git.openjdk.java.net/jdk/pull/8425/files/df7eb90e..0b7f84bb

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8425=05
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8425=04-05

  Stats: 21 lines in 10 files changed: 7 ins; 4 del; 10 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v5]

2022-05-17 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 13 commits:

 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Review comments resolution.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Correcting a typo.
 - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: AARCH64 backend changes.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - ... and 3 more: https://git.openjdk.java.net/jdk/compare/5e5500cb...df7eb90e

-

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=04
  Stats: 38068 lines in 254 files changed: 16705 ins; 16921 del; 4442 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v4]

2022-05-13 Thread Jatin Bhateja
On Thu, 12 May 2022 22:48:26 GMT, Vladimir Ivanov  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8284960: Review comments resolution.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7953:
> 
>> 7951: StubRoutines::x86::_vector_iota_indices = 
>> generate_iota_indices("iota_indices");
>> 7952: 
>> 7953: if (UsePopCountInstruction && VM_Version::supports_avx2() && 
>> !VM_Version::supports_avx512_vpopcntdq()) {
> 
> Why is the LUT unconditionally generated? `UsePopCountInstruction` still 
> guides the usages.

LUT should be generated only if UsePopCountInsturction is false and iff target 
does not support necessary features, AVX512POPCNTDQ (for int/long vectors)  and 
AVX512_BITALG (for sub-word vectors).  Please refer to following discussion 
where it was suggested to restrict the scope of flag to only scalar popcount 
operation. 
https://github.com/openjdk/panama-vector/pull/185#discussion_r847758463

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-13 Thread Jatin Bhateja
On Thu, 12 May 2022 22:40:50 GMT, Vladimir Ivanov  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 11 commits:
>> 
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: Correcting a typo.
>>  - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - 8284960: AARCH64 backend changes.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>>  - ... and 1 more: 
>> https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082
>
> src/hotspot/cpu/x86/matcher_x86.hpp line 195:
> 
>> 193:   case Op_PopCountVI:
>> 194: return ((ety == T_INT && 
>> VM_Version::supports_avx512_vpopcntdq()) ||
>> 195:(is_subword_type(ety) && 
>> VM_Version::supports_avx512_bitalg())) ? 0 : 50;
> 
> Should be easier to read when the condition is split. E.g.:
> 
> if (is_subword_type(ety)) {
>   return VM_Version::supports_avx512_bitalg())) ? 0 : 50;
> } else {
>   assert(ety == T_INT, "sanity"); // for documentation purposes
>   return VM_Version::supports_avx512_vpopcntdq() ? 0 : 50;
> }

DONE

> src/hotspot/cpu/x86/vm_version_x86.hpp line 375:
> 
>> 373: decl(RDTSCP,"rdtscp",48) /* RDTSCP 
>> instruction */ \
>> 374: decl(RDPID, "rdpid", 49) /* RDPID 
>> instruction */ \
>> 375: decl(FSRM,  "fsrm",  50) /* Fast Short REP 
>> MOV */ \
> 
> `test/lib-test/jdk/test/whitebox/CPUInfoTest.java` should be adjusted as 
> well, shouldn't it?

Yes, test updated appropriately.

> src/hotspot/share/classfile/vmIntrinsics.hpp line 1152:
> 
>> 1150:   
>> "Ljdk/internal/vm/vector/VectorSupport$ComExpOperation;)"
>> \
>> 1151:   
>> "Ljdk/internal/vm/vector/VectorSupport$VectorPayload;")  
>> \
>> 1152:do_name(vector_comexp_op_name, "comExpOp")  
>> \
> 
> I don't see much value in trying to shorten the name by abbreviating it. I 
> find it easier to read in an expanded form:
> ` compressExpandOp`, `vector_compress_expand_op_name`, 
> `_VectorCompressExpand`, etc.

DONE

> src/hotspot/share/opto/c2compiler.cpp line 521:
> 
>> 519: if (!Matcher::match_rule_supported(Op_SignumF)) return false;
>> 520: break;
>> 521:   case vmIntrinsics::_VectorComExp:
> 
> Why `_VectorComExp` intrinsic is special? Other vector intrinsics are handled 
> later and in a different manner.
> 
> What about `ExpandV` case?

It was an attempt to facilitate in-lining of these APIs over targets which do 
not intrinsify them. I agree its not a generic fix since three APIs are 
piggybacking on same entry point and without the knowledge of opcode it will be 
inappropriate to take any call at this place, lazy intrinsification gives 
opportunity for some of the predications to concertize as compilation happens 
under closed world assumptions.

> src/hotspot/share/opto/compile.cpp line 3416:
> 
>> 3414: 
>> 3415:   case Op_ReverseBytesV:
>> 3416:   case Op_ReverseV: {
> 
> Can you elaborate, please, why it is performed so late in the optimization 
> phase (at the very end during graph reshaping) and not during GVN?

Its more of a chicken-egg problem here, for masked reverse operation, Reverse 
IR node is followed by a Blend Node, thus in such a case doing an eager 
Identity transform in Reverse::Identity will not work, also deferring this to 
blend may also not work since it could be a non-masked reverse operation, we 
could have handled it as a special case in inline_vector_nary_operation, but 
handling such special case in final graph reshaping looked more appropriate.

https://github.com/openjdk/panama-vector/pull/182#discussion_r845678080

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v4]

2022-05-13 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8284960: Review comments resolution.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8425/files
  - new: https://git.openjdk.java.net/jdk/pull/8425/files/b021e082..adf205f9

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=8425=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8425=02-03

  Stats: 121 lines in 49 files changed: 8 ins; 5 del; 108 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3]

2022-05-10 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 11 commits:

 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Correcting a typo.
 - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: AARCH64 backend changes.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082

-

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=02
  Stats: 37901 lines in 214 files changed: 16527 ins; 16924 del; 4450 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v2]

2022-05-05 Thread Jatin Bhateja
On Thu, 5 May 2022 05:47:47 GMT, Jatin Bhateja  wrote:

>> Hi All,
>> 
>> Patch adds the planned support for new vector operations and APIs targeted 
>> for [JEP 426: Vector API (Fourth 
>> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
>> 
>> Following is the brief summary of changes:-
>> 
>> 1)  Extends the scope of existing lanewise API for following new vector 
>> operations.
>>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
>> bits
>>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
>> zero bits
>>- VectorOperations.REVERSE: reversing the order of bits
>>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>>- compress and expand bits: Semantics are based on Hacker's Delight 
>> section 7-4 Compress, or Generalized Extract.
>> 
>> 2)  Adds following new APIs to perform cross lane vector compress and 
>> expansion operations under the influence of a mask.
>>- Vector.compress
>>- Vector.expand 
>>- VectorMask.compress
>> 
>> 3) Adds predicated and non-predicated versions of following new APIs to load 
>> and store the contents of vector from foreign MemorySegments. 
>>   - Vector.fromMemorySegment
>>   - Vector.intoMemorySegment
>> 
>> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
>> for each newly added operation.
>> 
>> 
>>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
>> 
>>  Kindly review and share your feedback.
>> 
>>  Best Regards,
>>  Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 10 commits:
> 
>  - 8284960: Correcting a typo.
>  - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - 8284960: AARCH64 backend changes.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
>  - 8284960: Integration of JEP 426: Vector API (Fourth Incubator)

Hi @vnkozlov , It will be helpful if you can kindly review the changes.

-

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v2]

2022-05-04 Thread Jatin Bhateja
> Hi All,
> 
> Patch adds the planned support for new vector operations and APIs targeted 
> for [JEP 426: Vector API (Fourth 
> Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)
> 
> Following is the brief summary of changes:-
> 
> 1)  Extends the scope of existing lanewise API for following new vector 
> operations.
>-  VectorOperations.BIT_COUNT: counts the number of one-bits
>- VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
> bits
>- VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing 
> zero bits
>- VectorOperations.REVERSE: reversing the order of bits
>- VectorOperations.REVERSE_BYTES: reversing the order of bytes
>- compress and expand bits: Semantics are based on Hacker's Delight 
> section 7-4 Compress, or Generalized Extract.
> 
> 2)  Adds following new APIs to perform cross lane vector compress and 
> expansion operations under the influence of a mask.
>- Vector.compress
>- Vector.expand 
>- VectorMask.compress
> 
> 3) Adds predicated and non-predicated versions of following new APIs to load 
> and store the contents of vector from foreign MemorySegments. 
>   - Vector.fromMemorySegment
>   - Vector.intoMemorySegment
> 
> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
> for each newly added operation.
> 
> 
>  Patch has been regressed over AARCH64 and X86 targets different AVX levels. 
> 
>  Kindly review and share your feedback.
> 
>  Best Regards,
>  Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 10 commits:

 - 8284960: Correcting a typo.
 - 8284960: Integrating changes from panama-vector (Add @since 19 tags).
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: AARCH64 backend changes.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Integration of JEP 426: Vector API (Fourth Incubator)

-

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=01
  Stats: 37900 lines in 214 files changed: 16527 ins; 16923 del; 4450 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]

2022-05-04 Thread Jatin Bhateja
On Thu, 5 May 2022 03:17:35 GMT, Xiaohong Gong  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1363:
>> 
>>> 1361:   // Use the vector blend to implement the masked store. The 
>>> biased elements are the original
>>> 1362:   // values in the memory.
>>> 1363:   Node* mem_val = gvn().transform(LoadVectorNode::make(0, 
>>> control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt));
>> 
>> I'm sorry to say it, but I am pretty sure this is an invalid optimization.
>> See top-level comment for more details.
>
> Thanks for your comments! Yeah, this actually influences something due to the 
> Java Memory Model rules which I missed to consider more. I will try the 
> scatter ways instead. Thanks so much!

Yes, phantom store can write back stale unintended value and may create problem 
in multithreded applications since blending is done with an older loaded value.

-

PR: https://git.openjdk.java.net/jdk/pull/8544


RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator)

2022-04-29 Thread Jatin Bhateja
Hi All,

Patch adds the planned support for new vector operations and APIs targeted for 
[JEP 426: Vector API (Fourth 
Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173)

Following is the brief summary of changes:-

1)  Extends the scope of existing lanewise API for following new vector 
operations.
   -  VectorOperations.BIT_COUNT: counts the number of one-bits
   - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero 
bits
   - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero 
bits
   - VectorOperations.REVERSE: reversing the order of bits
   - VectorOperations.REVERSE_BYTES: reversing the order of bytes
   - compress and expand bits: Semantics are based on Hacker's Delight section 
7-4 Compress, or Generalized Extract.

2)  Adds following new APIs to perform cross lane vector compress and expansion 
operations under the influence of a mask.
   - Vector.compress
   - Vector.expand 
   - VectorMask.compress

3) Adds predicated and non-predicated versions of following new APIs to load 
and store the contents of vector from foreign MemorySegments. 
  - Vector.fromMemorySegment
  - Vector.intoMemorySegment

4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support 
for each newly added operation.


 Patch has been regressed over AARCH64 and X86 targets different AVX levels. 

 Kindly review and share your feedback.

 Best Regards,
 Jatin

-

Commit messages:
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: AARCH64 backend changes.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960
 - 8284960: Integration of JEP 426: Vector API (Fourth Incubator)

Changes: https://git.openjdk.java.net/jdk/pull/8425/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8425=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8284960
  Stats: 37837 lines in 214 files changed: 16462 ins; 16923 del; 4452 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8425.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425

PR: https://git.openjdk.java.net/jdk/pull/8425


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-28 Thread Jatin Bhateja
On Wed, 20 Apr 2022 02:44:39 GMT, Xiaohong Gong  wrote:

>>> The blend should be with the intended-to-store vector, so that masked lanes 
>>> contain the need-to-store elements and unmasked lanes contain the loaded 
>>> elements, which would be stored back, which results in unchanged values.
>> 
>> It may not work if memory is beyond legal accessible address space of the 
>> process, a corner case could be a page boundary.  Thus re-composing the 
>> intermediated vector which partially contains actual updates but effectively 
>> perform full vector write to destination address may not work in all 
>> scenarios.
>
> Thanks for the comment! So how about adding the check for the valid array 
> range like the masked vector load?
> Codes like:
> 
> public final
> void intoArray(byte[] a, int offset,
>VectorMask m) {
> if (m.allTrue()) {
> intoArray(a, offset);
> } else {
> ByteSpecies vsp = vspecies();
> if (offset >= 0 && offset <= (a.length - vsp.length())) { // 
> a full range check
> intoArray0(a, offset, m, /* usePred */ false);
>// can be vectorized by load+blend_store
> } else {
> checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
> intoArray0(a, offset, m, /* usePred */ true); 
>// only be vectorized by the predicated store
> }
> }
> }

Thanks, this looks ok since out of range condition will not be intrinsified if 
targets does not support predicated vector store.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements

2022-04-26 Thread Jatin Bhateja
On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu  wrote:

>> According to the Vector API doc, the LSHR operator computes 
>> a>>>(n&(ESIZE*8-1))

Documentation is correct if viewed strictly in context of subword vector lane, 
JVM internally promotes/sign extends subword type scalar variables into int 
type, but vectors are loaded from continuous memory holding subwords, it will 
not be correct for developer to imagine that individual subword type lanes will 
be upcasted into int lanes before being operated upon.

Thus both java implementation and compiler handling looks correct.

-

PR: https://git.openjdk.java.net/jdk/pull/8276


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-11 Thread Jatin Bhateja
On Thu, 31 Mar 2022 03:53:15 GMT, Xiaohong Gong  wrote:

>> Yeah, maybe I misunderstood what you mean. So maybe the masked store 
>> `(store(src, m))` could be implemented with:
>> 
>> 1) v1 = load
>> 2) v2 = blend(load, src, m)
>> 3) store(v2)
>> 
>> Let's record this a JBS and fix it with a followed-up patch. Thanks!
>
> The optimization for masked store is recorded to: 
> https://bugs.openjdk.java.net/browse/JDK-8284050

> The blend should be with the intended-to-store vector, so that masked lanes 
> contain the need-to-store elements and unmasked lanes contain the loaded 
> elements, which would be stored back, which results in unchanged values.

It may not work if memory is beyond legal accessible address space of the 
process, a corner case could be a page boundary.  Thus re-composing the 
intermediated vector which partially contains actual updates but effectively 
perform full vector write to destination address may not work in all scenarios.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4]

2022-04-06 Thread Jatin Bhateja
On Mon, 4 Apr 2022 07:24:12 GMT, Vamsi Parasa  wrote:

>> Also need a jtreg test for this.
>
>> Also need a jtreg test for this.
> 
> Thanks Sandhya for the review. Made the suggested changes and added jtreg 
> tests as well.

Hi @vamsi-parasa , thanks for addressing my comments, looks good to me 
otherwise apart from the outstanding comments.

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v9]

2022-04-06 Thread Jatin Bhateja
On Wed, 6 Apr 2022 06:02:07 GMT, Vamsi Parasa  wrote:

>> Optimizes the divideUnsigned() and remainderUnsigned() methods in 
>> java.lang.Integer and java.lang.Long classes using x86 intrinsics. This 
>> change shows 3x improvement for Integer methods and upto 25% improvement for 
>> Long. This change also implements the DivMod optimization which fuses 
>> division and modulus operations if needed. The DivMod optimization shows 3x 
>> improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Merge branch 'openjdk:master' into udivmod
>  - add error msg for jtreg test
>  - update jtreg test to run on x86_64
>  - add bmi1 support check and jtreg tests
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod
>  - fix 32bit build issues
>  - Fix line at end of file
>  - Move intrinsic code to macro assembly routines; remove unused 
> transformations for div and mod nodes
>  - fix trailing white space errors
>  - fix whitespaces
>  - ... and 3 more: 
> https://git.openjdk.java.net/jdk/compare/741be461...acba7c19

Marked as reviewed by jbhateja (Committer).

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8283726: x86 intrinsics for compare method in Integer and Long

2022-03-28 Thread Jatin Bhateja
On Sun, 27 Mar 2022 06:15:34 GMT, Vamsi Parasa  wrote:

> Implements x86 intrinsics for compare() method in java.lang.Integer and 
> java.lang.Long.

src/hotspot/cpu/x86/x86_64.ad line 12107:

> 12105: instruct compareSignedI_rReg(rRegI dst, rRegI op1, rRegI op2, rRegI 
> tmp, rFlagsReg cr)
> 12106: %{
> 12107:   match(Set dst (CompareSignedI op1 op2));

Please also include these patterns in x86_32.ad

src/hotspot/cpu/x86/x86_64.ad line 12125:

> 12123:  __ movl(tmp, 0);
> 12124: __ bind(done);
> 12125: __ movl(dest, tmp);

Please move this in macro-assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 12178:

> 12176:  __ movl(tmp, 0);
> 12177: __ bind(done);
> 12178: __ movl(dest, tmp);

Please move this into a macro-assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 12204:

> 12202:  __ movl(tmp, 0);
> 12203: __ bind(done);
> 12204: __ movl(dest, tmp);

Please move this into macro-assembly routine.

src/hotspot/share/classfile/vmIntrinsics.hpp line 239:

> 237:   do_intrinsic(_compareUnsigned_i,java_lang_Integer,  
> compare_unsigned_name,   int2_int_signature,   F_S)   \
> 238:do_name( compare_unsigned_name, 
> "compareUnsigned")   \
> 239:   do_intrinsic(_compareUnsigned_l,java_lang_Long, 
> compare_unsigned_name,   long2_int_signature,   F_S)  \

Creating these methods as intrinsic will create a box around the underneath 
comparison logic, this shall prevent any regular constant folding which could 
have optimized out certain control paths, I would suggest to to handle constant 
folding for newly added nodes in associated Value routines.

src/hotspot/share/opto/comparenode.hpp line 67:

> 65: CompareUnsignedLNode(Node* in1, Node* in2) : CompareNode(in1, in2) {}
> 66: virtual int Opcode() const;
> 67: };

Intent here seems to be to enable further auto-vectorization of newly create IR 
nodes.

test/micro/org/openjdk/bench/java/lang/CompareInteger.java line 78:

> 76: input2[i] = tmp;
> 77: }
> 78: }

Logic re-organization suggestion:- 
 

 for (int i = 0 ; i < BUFFER_SIZE; i++) {
input1[i] = rng.nextLong();
 }

 if (mode.equals("equals") {
GRADIANT = 0;
 } else if (mode.equals("greaterThanEquals")) {
GRADIANT = 1;
 } else {
assert mode.equals("lessThanEqual");
GRADIANT = -1;
 }

 for(int i = 0 ; i < BUFFER_SIZE; i++) {
input2[i] = input1[i] + i*GRADIANT;
 }

test/micro/org/openjdk/bench/java/lang/CompareLong.java line 5:

> 3:  * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> 4:  *
> 5:  * This code is free software; you can redistribute it and/or modify it

We can unify this benchmark along with integer compare micro.

-

PR: https://git.openjdk.java.net/jdk/pull/7975


Re: RFR: 8279508: Auto-vectorize Math.round API [v18]

2022-03-24 Thread Jatin Bhateja
On Wed, 23 Mar 2022 06:55:50 GMT, Tobias Hartmann  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 22 commits:
>> 
>>  - 8279508: Using an explicit scratch register since rscratch1 is bound to 
>> r10 and its usage is transparent to compiler.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>>  - 8279508: Windows build failure fix.
>>  - 8279508: Styling comments resolved.
>>  - 8279508: Creating separate test for round double under feature check.
>>  - 8279508: Reducing the invocation count and compile thresholds for 
>> RoundTests.java.
>>  - 8279508: Review comments resolution.
>>  - 8279508: Preventing domain switch-over penalty for Math.round(float) and 
>> constraining unrolling to prevent code bloating.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>>  - 8279508: Removing +LogCompilation flag.
>>  - ... and 12 more: 
>> https://git.openjdk.java.net/jdk/compare/ff0b0927...c17440cf
>
> All tests passed.

Hi @TobiHartmann , thanks for confirming.
Hi @jddarcy , @theRealAph , kindly let me know if its good to integrate this.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-21 Thread Jatin Bhateja
On Tue, 22 Mar 2022 01:55:38 GMT, Quan Anh Mai  wrote:

>> A read from constant table will incur minimum of L1I access penalty to 
>> access code blob or at worst even more if data is not present in first level 
>> cache. Change was done for replace vpbroadcastd with vbroadcastss because of 
>> two reasons.
>> 1) vbroadcastss works at AVX=1 level where as vpbroadcastd need AVX2 
>> feature. 
>> 2) We can avoid extra cycle penalty due to two domain switchovers (FP -> INT 
>> and then from INT-> FP).
>
>> A read from constant table will incur minimum of L1I access penalty to 
>> access code blob or at worst even more if data is not present in first level 
>> cache
> 
> But your approach comes at a cost of frontend bandwidth and port contention, 
> which imo are more important than latency in this case since a constant load 
> does not prolong dependency chains. A load has very good throughput so it is 
> often performant unless the load depends on its input (the memory location or 
> the registers used for address calculation). Thanks

Thanks for going into details, multicycle memory load will also defer dispatch 
of dependent instructions to execution port, port congestion becomes bottleneck 
when multiple ready instructions cannot be issued due to lack of execution 
resource or throughput constraints imposed by instruction,  but a single cycle 
dependency chain may still win over  latency due to pending memory  operations.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-21 Thread Jatin Bhateja
On Mon, 21 Mar 2022 17:56:22 GMT, Quan Anh Mai  wrote:

>> constant and register to register moves are never issued to execution ports, 
>>  rematerializing value rather than reading from memory will give better 
>> performance.
>
> I have come across this a little bit. While `movl r, i` may not consume 
> execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 
> 3 retired and 2 executed uops. Furthermore, both `movq x, r` and 
> `vbroadcastss x, x` can only run on port 5, limit the throughput of the 
> operation. On the contrary, a `vbroadcastss x, m` only results in 1 retired 
> and 1 executed uop, reducing pressure on the decoder and the backend. A 
> `vbroadcastss x, m` can run on both port 2 and port 3, offering a much better 
> throughput. Latency is not much of a concern in this circumstance since the 
> operation does not have any input dependency.
> 
>> register to register moves are never issued to execution ports
> 
> I believe you misremembered this part, a register to register move is only 
> elided when the registers are of the same kind, `vmovq x, r` would result in 
> 1 uop being executed on port 5.
> 
> What do you think? Thank you very much.

A read from constant table will incur minimum of L1I access penalty to access 
code blob or at worst even more if data is not present in first level cache. 
Change was done for replace vpbroadcastd with vbroadcastss because of two 
reasons.
1) vbroadcastss works at AVX=1 level where as vpbroadcastd need AVX2 feature. 
2) We can avoid extra cycle penalty due to two domain switchovers (FP -> INT 
and then from INT-> FP).

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v17]

2022-03-18 Thread Jatin Bhateja
On Mon, 14 Mar 2022 10:35:58 GMT, Tobias Hartmann  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Windows build failure fix.
>
> `compiler/c2/cr6340864/TestFloatVect.java` and `TestDoubleVect.java` fail on 
> Windows:
> 
> 
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  EXCEPTION_ACCESS_VIOLATION (0xc005) at pc=0x01971b940123, 
> pid=56524, tid=57368
> #
> # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 
> 19-internal-2022-03-14-0834080.tobias.hartmann.jdk2)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 
> 19-internal-2022-03-14-0834080.tobias.hartmann.jdk2, mixed mode, sharing, 
> tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
> # Problematic frame:
> # J 205 c2 compiler.c2.cr6340864.TestFloatVect.test_round([I[F)V (24 bytes) @ 
> 0x01971b940123 [0x01971b93ffe0+0x0143]

Hi @TobiHartmann , Can you kindly regress latest changes through your test 
infrastructure 
Hi @theRealAph , Your suggestions incorporated.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v18]

2022-03-18 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 22 commits:

 - 8279508: Using an explicit scratch register since rscratch1 is bound to r10 
and its usage is transparent to compiler.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Windows build failure fix.
 - 8279508: Styling comments resolved.
 - 8279508: Creating separate test for round double under feature check.
 - 8279508: Reducing the invocation count and compile thresholds for 
RoundTests.java.
 - 8279508: Review comments resolution.
 - 8279508: Preventing domain switch-over penalty for Math.round(float) and 
constraining unrolling to prevent code bloating.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Removing +LogCompilation flag.
 - ... and 12 more: https://git.openjdk.java.net/jdk/compare/ff0b0927...c17440cf

-

Changes: https://git.openjdk.java.net/jdk/pull/7094/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=7094=17
  Stats: 800 lines in 25 files changed: 707 ins; 30 del; 63 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-14 Thread Jatin Bhateja
On Mon, 14 Mar 2022 09:29:28 GMT, Andrew Haley  wrote:

>> Good suggestion, but as of now we are not using vector calling conventions 
>> for stubs.
>
> I don't understand this comment. If the stub is only to be used by you, then 
> you can determine your own calling convention.

We are passing mixture of scalar, vector and opmask register to special 
handling function, only way we can pass them reliably to callee stub without 
having an elaborate mixed calling convention will be by bounding the machine 
operands.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v17]

2022-03-12 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Windows build failure fix.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/c881d11c..b1323a82

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=16
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=15-16

  Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v16]

2022-03-12 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Styling comments resolved.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/e4d4e29b..c881d11c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=15
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=14-15

  Stats: 11 lines in 3 files changed: 3 ins; 3 del; 5 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-12 Thread Jatin Bhateja
On Sun, 13 Mar 2022 00:06:07 GMT, Quan Anh Mai  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4161:
>> 
>>> 4159:   movl(scratch, 1056964608);
>>> 4160:   movq(xtmp1, scratch);
>>> 4161:   vbroadcastss(xtmp1, xtmp1, vec_enc);
>> 
>> An `evpbroadcastd` would reduce this by one instruction I guess?
>
> Anyway an `evpbroadcastd xmm, r` has around 5 latency on the gpr so I think 
> you could just put the constant in the constant table and use `vbroadcastsd`

It was done to save redundant floating point to integer domain switch over 
penalties.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-12 Thread Jatin Bhateja
On Sat, 12 Mar 2022 23:20:58 GMT, Quan Anh Mai  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Creating separate test for round double under feature check.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4024:
> 
>> 4022:  * the result is equal to the value of Integer.MAX_VALUE.
>> 4023:  */
>> 4024: void 
>> C2_MacroAssembler::vector_cast_float_special_cases_avx(XMMRegister dst, 
>> XMMRegister src, XMMRegister xtmp1,
> 
> This special handling is really large, could we use a stub routine for it?

Good suggestion, but as of now we are not using vector calling conventions for 
stubs.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4178:
> 
>> 4176:   movl(scratch, 1056964608);
>> 4177:   movq(xtmp1, scratch);
>> 4178:   vbroadcastss(xtmp1, xtmp1, vec_enc);
> 
> You could put the constant in the constant table and use `vbroadcastss` here 
> also.
> 
> Thank you very much.

constant and register to register moves are never issued to execution ports,  
rematerializing value rather than reading from memory will give better 
performance.

> src/hotspot/cpu/x86/x86.ad line 7297:
> 
>> 7295:   ins_encode %{
>> 7296: int vlen_enc = vector_length_encoding(this);
>> 7297: InternalAddress new_mxcsr = $constantaddress(0x3F80L);
> 
> `ldmxcsr` takes a `m32` argument so this constant can be an `int` instead. 
> Also, I would suggest putting the `mxcst_std` in the constant table also.

Correct, if we do so constant emitted will occupy 4 bytes.  

FTR we can also improve on the alignment padding for constants such that start 
address of next emitted constant aligned based on the constant size. This will 
be beneficial for large sized vector constants (32/64 byte) as we can save 
cache line split penalty during vector load.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

2022-03-12 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Creating separate test for round double under feature check.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/2519a58c..e4d4e29b

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=14
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=13-14

  Stats: 239 lines in 3 files changed: 143 ins; 96 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v14]

2022-03-11 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Reducing the invocation count and compile thresholds for 
RoundTests.java.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/fcb73212..2519a58c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=13
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=12-13

  Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-03-11 Thread Jatin Bhateja
On Thu, 10 Mar 2022 14:29:36 GMT, Joe Darcy  wrote:

>> Hi @jddarcy , 
>> 
>> Test has been modified on the same lines using generic options which 
>> manipulate compilation thresholds and agnostic to target platforms.
>> 
>>  * @run main/othervm -XX:Tier3CompileThreshold=100 
>> -XX:CompileThresholdScaling=0.01 -XX:+TieredCompilation RoundTests
>> 
>> Verified that  RoundTests::test* methods gets compiled by c2.
>> Test execution time with and without change is almost same ~7.80sec over 
>> Skylake-server.
>> 
>> Regards
>
> To be more explicit, the existing RoundTests.java test runs in a fraction of 
> a second.  The updated test runs many times slower, even if now under 10 
> second, at least on some platforms.
> 
> Can something closer to the original performance be restored?
> 
> As a tier 1 library test, these tests are run quite frequently.

Hi @jddarcy , 
Earlier none of the test methods in  RoundTests.java were compiled on account 
of low invocation count,  a loop with 2000 iterations under the influence 
controlled compilation threshold now triggers tier4 compilation of test points. 
I did several runs in Skylake machine with patch and without patch and could 
see no perceptible difference in runtime due to modification.

I have further reduced the invocation count and compile threshold.

Thanks

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v13]

2022-03-10 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Review comments resolution.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/547f4e31..fcb73212

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=12
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=11-12

  Stats: 13 lines in 3 files changed: 6 ins; 3 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v12]

2022-03-09 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 15 commits:

 - 8279508: Preventing domain switch-over penalty for Math.round(float) and 
constraining unrolling to prevent code bloating.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Removing +LogCompilation flag.
 - 8279508: Review comments resolved.`
 - 8279508: Adding descriptive comments.
 - 8279508: Review comments resolved.
 - 8279508: Review comments resolved.
 - 8279508: Fixing for windows failure.
 - 8279508: Adding few descriptive comments.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - ... and 5 more: https://git.openjdk.java.net/jdk/compare/d07f7c76...547f4e31

-

Changes: https://git.openjdk.java.net/jdk/pull/7094/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=7094=11
  Stats: 752 lines in 24 files changed: 660 ins; 30 del; 62 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v11]

2022-03-06 Thread Jatin Bhateja
On Sun, 6 Mar 2022 09:31:27 GMT, Andrew Haley  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Removing +LogCompilation flag.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4157:
> 
>> 4155:   ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std());
>> 4156:   ldmxcsr(new_mxcsr);
>> 4157:   movl(scratch, 1056964608);
> 
> What is 1056964608 ?

Raw bits corresponding to floating point value 0.5f.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-03-04 Thread Jatin Bhateja
On Fri, 4 Mar 2022 06:06:52 GMT, Joe Darcy  wrote:

>> test/jdk/java/lang/Math/RoundTests.java line 32:
>> 
>>> 30: public static void main(String... args) {
>>> 31: int failures = 0;
>>> 32: for (int i = 0; i < 10; i++) {
>> 
>> Is there an idiom to trigger the auto-vectorization, perhaps using command 
>> line arguments, that doesn't bloat the running time of this test?
>
> IMO RoundTests should have a explicit @run tag without any VM options as well.
> 
> Do the added VM options run on all platforms in question? What is the 
> approximate time to run the test run compared to before?

Hi @jddarcy , 

Test has been modified on the same lines using generic options which manipulate 
compilation thresholds and agnostic to target platforms.

 * @run main/othervm -XX:Tier3CompileThreshold=100 
-XX:CompileThresholdScaling=0.01 -XX:+TieredCompilation RoundTests

Verified that  RoundTests::test* methods gets compiled by c2.
Test execution time with and without change is almost same ~7.80sec over 
Skylake-server.

Regards

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v2]

2022-03-02 Thread Jatin Bhateja
On Wed, 19 Jan 2022 22:09:26 GMT, Joe Darcy  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Adding a test for scalar intrinsification.
>
> The testing for this PR doesn't look adequate to me. I don't see any testing 
> for the values where the behavior of round has been redefined at points in 
> the last decade. See JDK-8010430 and JDK-6430675, both of which have 
> regression tests in the core libs area. Thanks.

Hi @jddarcy , can you kindly validate your feedback, it has been incorporated.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v11]

2022-03-01 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Removing +LogCompilation flag.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/3b90ae53..57b1b13a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=10
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=09-10

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v10]

2022-03-01 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Review comments resolved.`

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/54d4ea36..3b90ae53

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=09
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=08-09

  Stats: 12 lines in 2 files changed: 1 ins; 0 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-02-25 Thread Jatin Bhateja
On Fri, 25 Feb 2022 06:22:42 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Adding descriptive comments.

As per SDM, if post conversion a floating point number is non-representable in 
destination format e.g. a floating point value 3.4028235E10 post integer 
conversion will overflow the value range of integer primitive type, hence a 
-0.0 value or 0x8000 is returned here. Similarly for +/- NaN and  +/-Inf 
post conversion value returns is -0.0.  All these cases i.e. post conversion 
non-representable floating point values and NaN/Inf values are handled in a 
special manner where algorithm first performs an unordered comparison b/w 
original source value and returns a 0 in case of  NaN, this weeds out the NaN 
case and for rest of the special values we check the MSB bit of the source and 
either return an Integer.MAX_VALUE for +ve numbers or a Integer.MIN_VALUE to 
adhere to the semantics of Math.round API.

Existing tests were enhanced to cover various special cases (NaN/Inf/+ve/-ve 
value/values which may be inexact after adding 0.5/ values which post 
conversion overflow integer value range).

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-02-24 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
> 510.36 | 548.39 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
> 293.48 | 274.01 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
> 751.83 | 2274.13 | 3.02
> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
> 388.52 | 1334.18 | 3.43
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Adding descriptive comments.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/f7dec3d9..54d4ea36

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=08
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=07-08

  Stats: 31 lines in 2 files changed: 14 ins; 0 del; 17 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-24 Thread Jatin Bhateja
On Thu, 24 Feb 2022 00:43:27 GMT, Sandhya Viswanathan 
 wrote:

> Also curious, how does the performance look with all these changes.

Updated new perf numbers.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4]

2022-02-24 Thread Jatin Bhateja
On Thu, 24 Feb 2022 02:43:46 GMT, Vamsi Parasa  wrote:

>> Optimizes the divideUnsigned() and remainderUnsigned() methods in 
>> java.lang.Integer and java.lang.Long classes using x86 intrinsics. This 
>> change shows 3x improvement for Integer methods and upto 25% improvement for 
>> Long. This change also implements the DivMod optimization which fuses 
>> division and modulus operations if needed. The DivMod optimization shows 3x 
>> improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   fix 32bit build issues

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4408:

> 4406:   jmp(done);
> 4407:   bind(neg_divisor_fastpath);
> 4408:   // Fastpath for divisor < 0:

How about checking if divisor is +ve or -ve constant and non-constant dividend 
in identity routine and setting a flag in IR node, which can be used to either 
emit fast / slow path in a new instruction selection pattern. It will save 
emitting redundant instructions.

src/hotspot/share/opto/divnode.cpp line 881:

> 879:   return (phase->type( in(2) )->higher_equal(TypeLong::ONE)) ? in(1) : 
> this;
> 880: }
> 881: 
> //--Value--

Ideal transform to replace unsigned divide by cheaper logical right shift 
instruction if divisor is POW will be useful.

src/hotspot/share/opto/divnode.cpp line 897:

> 895: 
> 896:   // Either input is BOTTOM ==> the result is the local BOTTOM
> 897:   const Type *bot = bottom_type();

Can we add constant folding handling when both dividend and divisor are 
constants.

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-24 Thread Jatin Bhateja
On Thu, 24 Feb 2022 01:43:27 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Review comments resolved.
>
> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 8984:
> 
>> 8982: }
>> 8983: 
>> 8984: void MacroAssembler::round_double(Register dst, XMMRegister src, 
>> Register rtmp, Register rcx) {
> 
> Is it possible to implement this using the similar mxcsr change? In any case 
> comments will help to review round_double and round_float code.

LDMXCSR has multi-cycle latency and it will degrade the performance of scalar 
operation's fast path.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v8]

2022-02-24 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | --
> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Review comments resolved.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/6c869c76..f7dec3d9

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=07
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=06-07

  Stats: 35 lines in 5 files changed: 8 ins; 22 del; 5 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-23 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | --
> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Review comments resolved.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/f35ed9cf..6c869c76

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=06
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=05-06

  Stats: 7 lines in 2 files changed: 0 ins; 3 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v6]

2022-02-23 Thread Jatin Bhateja
On Wed, 23 Feb 2022 01:31:24 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Fixing for windows failure.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4146:
> 
>> 4144:   vaddpd(xtmp1, src , xtmp1, vec_enc);
>> 4145:   vrndscalepd(dst, xtmp1, 0x4, vec_enc);
>> 4146:   evcvtpd2qq(dst, dst, vec_enc);
> 
> Why do we need vrndscalepd in between, could we not directly use cvtpd2qq 
> after vaddpd?

Thanks @sviswa7 , when a conversion is inexact, the value returned is rounded 
according to the rounding control bits in the MXCSR register or the embedded 
rounding control bits. DONE.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long

2022-02-22 Thread Jatin Bhateja
On Tue, 22 Feb 2022 09:24:47 GMT, Vamsi Parasa  wrote:

> Optimizes the divideUnsigned() and remainderUnsigned() methods in 
> java.lang.Integer and java.lang.Long classes using x86 intrinsics. This 
> change shows 3x improvement for Integer methods and upto 25% improvement for 
> Long. This change also implements the DivMod optimization which fuses 
> division and modulus operations if needed. The DivMod optimization shows 3x 
> improvement for Integer and ~65% improvement for Long.

src/hotspot/cpu/x86/x86_64.ad line 8602:

> 8600: __ jmp(done);
> 8601: __ bind(neg_divisor_fastpath); 
> 8602: // Fastpath for divisor < 0: 

Move in macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 8633:

> 8631: __ jmp(done);
> 8632: __ bind(neg_divisor_fastpath);
> 8633: // Fastpath for divisor < 0: 

Move in macro assembly rountine.

src/hotspot/cpu/x86/x86_64.ad line 8722:

> 8720: __ shrl(rax, 31); // quotient
> 8721: __ sarl(tmp, 31);
> 8722: __ andl(tmp, divisor);

Move in macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 8763:

> 8761: __ andnq(rax, rax, rdx);
> 8762: __ movq(tmp, rax);
> 8763: __ shrq(rax, 63); // quotient

Move in macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 8902:

> 8900: __ subl(tmp_rax, divisor);
> 8901: __ andnl(tmp_rax, tmp_rax, rdx);
> 8902: __ sarl(tmp_rax, 31);

Please move this into a macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 8932:

> 8930: // Fastpath when divisor < 0: 
> 8931: // remainder = dividend - (((dividend & ~(dividend - divisor)) >> 
> (Long.SIZE - 1)) & divisor)
> 8932: // See Hacker's Delight (2nd ed), section 9.3 which is implemented 
> in java.lang.Long.remainderUnsigned()

Please move it into a macro assembly routine.

src/hotspot/share/opto/compile.cpp line 3499:

> 3497:   Node* d = n->find_similar(Op_UDivI);
> 3498:   if (d) {
> 3499: // Replace them with a fused unsigned divmod if supported

Can you explain a bit here, why can't this transformation be handled earlier ?

src/hotspot/share/opto/divnode.cpp line 1350:

> 1348: return NULL;
> 1349:   }
> 1350: 

Please remove Value and Ideal routines if no explicit transforms are being done.

src/hotspot/share/opto/divnode.cpp line 1362:

> 1360:   }
> 1361: 
> 1362: 
> //=

You can remove Ideal routine is not transformation is being done.

test/micro/org/openjdk/bench/java/lang/IntegerDivMod.java line 76:

> 74: return quotients;
> 75: }
> 76: 

Return seems redundant here.

test/micro/org/openjdk/bench/java/lang/IntegerDivMod.java line 83:

> 81: }
> 82: return remainders;
> 83: }

Return seems redundant here.

test/micro/org/openjdk/bench/java/lang/LongDivMod.java line 75:

> 73: }
> 74: return quotients;
> 75: }

Do we need to return quotients, since it's a field  being explicitly modified.

test/micro/org/openjdk/bench/java/lang/LongDivMod.java line 82:

> 80: remainders[i] = Long.remainderUnsigned(dividends[i], 
> divisors[i]);
> 81: }
> 82: return remainders;

Same as above

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8279508: Auto-vectorize Math.round API [v6]

2022-02-17 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | --
> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Fixing for windows failure.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/73674fe4..f35ed9cf

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=05
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=04-05

  Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v5]

2022-02-16 Thread Jatin Bhateja
On Wed, 16 Feb 2022 12:30:27 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | 
>> 510.35 | 548.60 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | 
>> 293.60 | 273.15 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | 
>> 825.32 | 1836.42 | 2.23
>> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | 
>> 412.31 | 945.82 | 2.29
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - 8279508: Adding few descriptive comments.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - 8279508: Replacing by efficient instruction sequence based on MXCSR.RC 
> mode.
>  - 8279508: Adding vectorized algorithms to match the semantics of rounding 
> operations.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>  - 8279508: Adding a test for scalar intrinsification.
>  - 8279508: Auto-vectorize Math.round API

> _Mailing list message from [Joseph D. Darcy](mailto:joe.da...@oracle.com) on 
> [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_
> 
> On 2/12/2022 6:55 PM, Jatin Bhateja wrote:
> 
> > On Fri, 21 Jan 2022 00:49:04 GMT, Sandhya Viswanathan  > openjdk.org> wrote:
> > > The JVM currently initializes the x86 mxcsr to round to nearest even, see 
> > > below in stubGenerator_x86_64.cpp: // Round to nearest (even), 64-bit 
> > > mode, exceptions masked StubRoutines::x86::_mxcsr_std = 0x1F80; The above 
> > > works for Math.rint which is specified to be round to nearest even. 
> > > Please see: 
> > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
> > >  : section 4.8.4
> > > The rounding mode needed for Math.round is round to positive infinity 
> > > which needs a different x86 mxcsr initialization(0x5F80).
> > > Hi @sviswa7 ,
> > > As per JLS 17 section 15.4 Java follows round to nearest rounding policy 
> > > for all floating point operations except conversion to integer and 
> > > remainder where it uses round toward zero.
> 
> That is a true background condition, but I will note that the Math.round 
> method does independently define the semantics of its operation and rounding 
> behavior, which has changed (slightly) over the lifetime of the platform.
> 
> -Joe

Hi @jddarcy , Thanks for your comments, patch has been updated to follow the 
prescribed semantics  of Math.round API.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-16 Thread Jatin Bhateja
On Wed, 16 Feb 2022 12:26:45 GMT, Jatin Bhateja  wrote:

>>> > Hi, IIRC for evex encoding you can embed the RC control bit directly in 
>>> > the evex prefix, removing the need to rely on global MXCSR register. 
>>> > Thanks.
>>> 
>>> Hi @merykitty , You are correct, we can embed RC mode in instruction 
>>> encoding of round instruction (towards -inf,+inf, zero). But to match the 
>>> semantics of Math.round API one needs to add 0.5[f] to input value and then 
>>> perform rounding over resultant value, which is why @sviswa7 suggested to 
>>> use a global rounding mode driven by MXCSR.RC so that intermediate floating 
>>> inexact values are resolved as desired, but OOO execution may misplace 
>>> LDMXCSR and hence may have undesired side effects.
>> 
>> **Just want to correct above statement, LDMXCSR will not be 
>> re-ordered/re-scheduled early OOO backend.**
>
>> That pseudocode would make a very useful comment too. This whole patch is 
>> very thinly commented.
> 
> I have replaced earlier bulky sequence, new sequence is having similar 
> performance but reduction in code may improve inlining behavior.  Added 
> descriptive comments around the special cases.

> There are already `RoundFloat`, `RoundDouble`, and `RoundDoubleMode` nodes 
> defined.
> 
> Though `RoundFloat` and `RoundDouble` are legacy nodes used only on x86-32, 
> `RoundDoubleMode` supports multiple rounding modes and is amenable to 
> auto-vectorization.
> 
> What do you think about the following alternative?
> 
> Reuse `RoundDoubleMode` (with a new rounding mode) and introduce 
> `RoundFloatMode`.
> 
> Special rounding rules is not the only peculiarity of `Math.round()`. It also 
> converts the result to an integral type. It can be represented as `ConvF2I 
> (RoundFloatMode f #rmode)` / `ConvD2L (RoundDoubleMode d #rmode)`. In scalar 
> case, it can be matched as a single AD instruction.
> 
> Auto-vectorizer can then convert it to `VectorCastF2X (RoundFloatModeV vf 
> #rmode)` / `VectorCastD2X (RoundDoubleModeV vd #rmode)` and match it in a 
> similar manner.

Adding new rounding mode to RoundDoubleMode may disturb other targets. 
match_rule_supported routine operates over Opcodes and currently any target 
supporting RoundDoubleMode generates code for all the rounding modes. Your 
solution is anyways based on creating new scalar and vector IR node for 
floating point rounding operation, which is what patch is doing currently.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v5]

2022-02-16 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | 
> 510.35 | 548.60 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | 
> 293.60 | 273.15 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | 
> 825.32 | 1836.42 | 2.23
> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | 
> 412.31 | 945.82 | 2.29
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains seven commits:

 - 8279508: Adding few descriptive comments.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Replacing by efficient instruction sequence based on MXCSR.RC mode.
 - 8279508: Adding vectorized algorithms to match the semantics of rounding 
operations.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Adding a test for scalar intrinsification.
 - 8279508: Auto-vectorize Math.round API

-

Changes: https://git.openjdk.java.net/jdk/pull/7094/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=7094=04
  Stats: 739 lines in 23 files changed: 648 ins; 29 del; 62 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-16 Thread Jatin Bhateja
On Mon, 14 Feb 2022 17:14:10 GMT, Jatin Bhateja  wrote:

>> That pseudocode would make a very useful comment too. This whole patch is 
>> very thinly commented.
>
>> > Hi, IIRC for evex encoding you can embed the RC control bit directly in 
>> > the evex prefix, removing the need to rely on global MXCSR register. 
>> > Thanks.
>> 
>> Hi @merykitty , You are correct, we can embed RC mode in instruction 
>> encoding of round instruction (towards -inf,+inf, zero). But to match the 
>> semantics of Math.round API one needs to add 0.5[f] to input value and then 
>> perform rounding over resultant value, which is why @sviswa7 suggested to 
>> use a global rounding mode driven by MXCSR.RC so that intermediate floating 
>> inexact values are resolved as desired, but OOO execution may misplace 
>> LDMXCSR and hence may have undesired side effects.
> 
> **Just want to correct above statement, LDMXCSR will not be 
> re-ordered/re-scheduled early OOO backend.**

> That pseudocode would make a very useful comment too. This whole patch is 
> very thinly commented.

I have replaced earlier bulky sequence, new sequence is having similar 
performance but reduction in code may improve inlining behavior.  Added 
descriptive comments around the special cases.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v4]

2022-02-16 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | 
> 510.35 | 548.60 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | 
> 293.60 | 273.15 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | 
> 825.32 | 1836.42 | 2.23
> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | 
> 412.31 | 945.82 | 2.29
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Replacing by efficient instruction sequence based on MXCSR.RC mode.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/2dc364fa..1c9ff777

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=02-03

  Stats: 143 lines in 4 files changed: 4 ins; 82 del; 57 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-14 Thread Jatin Bhateja
On Mon, 14 Feb 2022 09:12:54 GMT, Andrew Haley  wrote:

>>> What does this do? Comment, even pseudo code, would be nice.
>> 
>> Thanks @theRealAph , I shall append the comments over the routine.
>> BTW, entire rounding algorithm can also be implemented using  Vector API 
>> which can perform if-conversion using masked operations.
>> 
>> class roundf {
>>public static VectorSpecies ISPECIES = IntVector.SPECIES_512;
>>public static VectorSpecies SPECIES = FloatVector.SPECIES_512;
>> 
>>public static int round_vector(float[] a, int[] r, int ctr) {
>>   IntVector shiftVBC = (IntVector) ISPECIES.broadcast(24 - 2 + 127);
>>   for (int i = 0; i < a.length; i += SPECIES.length()) {
>>  FloatVector fv = FloatVector.fromArray(SPECIES, a, i);
>>  IntVector iv = fv.reinterpretAsInts();
>>  IntVector biasedExpV = iv.lanewise(VectorOperators.AND, 0x7F80);
>>  biasedExpV = biasedExpV.lanewise(VectorOperators.ASHR, 23);
>>  IntVector shiftV = shiftVBC.lanewise(VectorOperators.SUB, 
>> biasedExpV);
>>  VectorMask cond = shiftV.lanewise(VectorOperators.AND, -32)
>>.compare(VectorOperators.EQ, 0);
>>  IntVector res = iv.lanewise(VectorOperators.AND, 0x007F)
>>.lanewise(VectorOperators.OR, 0x007F + 1);
>>  VectorMask cond1 = iv.compare(VectorOperators.LT, 0);
>>  VectorMask cond2 = cond1.and(cond);
>>  res = res.lanewise(VectorOperators.NEG, cond2);
>>  res = res.lanewise(VectorOperators.ASHR, shiftV)
>>.lanewise(VectorOperators.ADD, 1)
>>.lanewise(VectorOperators.ASHR, 1);
>>  res = fv.convert(VectorOperators.F2I, 0)
>>.reinterpretAsInts()
>>.blend(res, cond);
>>  res.intoArray(r, i);
>>   }
>>   return r[ctr];
>>}
>
> That pseudocode would make a very useful comment too. This whole patch is 
> very thinly commented.

> > Hi, IIRC for evex encoding you can embed the RC control bit directly in the 
> > evex prefix, removing the need to rely on global MXCSR register. Thanks.
> 
> Hi @merykitty , You are correct, we can embed RC mode in instruction encoding 
> of round instruction (towards -inf,+inf, zero). But to match the semantics of 
> Math.round API one needs to add 0.5[f] to input value and then perform 
> rounding over resultant value, which is why @sviswa7 suggested to use a 
> global rounding mode driven by MXCSR.RC so that intermediate floating inexact 
> values are resolved as desired, but OOO execution may misplace LDMXCSR and 
> hence may have undesired side effects.

**Just want to correct above statement, LDMXCSR will not be 
re-ordered/re-scheduled early OOO backend.**

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-13 Thread Jatin Bhateja
On Sun, 13 Feb 2022 13:08:41 GMT, Jatin Bhateja  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4066:
>> 
>>> 4064: }
>>> 4065: 
>>> 4066: void 
>>> C2_MacroAssembler::vector_cast_double_special_cases_evex(XMMRegister dst, 
>>> XMMRegister src, XMMRegister xtmp1,
>> 
>> What does this do? Comment, even pseudo code, would be nice.
>
>> Hi, IIRC for evex encoding you can embed the RC control bit directly in the 
>> evex prefix, removing the need to rely on global MXCSR register. Thanks.
> 
> Hi @merykitty ,  You are correct, we can embed RC mode in instruction 
> encoding of round instruction (towards -inf,+inf, zero). But to match the 
> semantics of Math.round API one needs to add 0.5[f] to input value and then 
> perform rounding over resultant value, which is why @sviswa7 suggested to use 
> a global rounding mode driven by MXCSR.RC so that intermediate floating 
> inexact values also are resolved as desired, but OOO execution may misplace 
> LDMXCSR and hence may have undesired side effects.

> What does this do? Comment, even pseudo code, would be nice.

Thanks @theRealAph , I shall append the comments over the routine.
BTW, entire rounding algorithm can also be implemented using  Vector API which 
can perform if-conversion using masked operations.

class roundf {
   public static VectorSpecies ISPECIES = IntVector.SPECIES_512;
   public static VectorSpecies SPECIES = FloatVector.SPECIES_512;

   public static int round_vector(float[] a, int[] r, int ctr) {
  IntVector shiftVBC = (IntVector) ISPECIES.broadcast(24 - 2 + 127);
  for (int i = 0; i < a.length; i += SPECIES.length()) {
 FloatVector fv = FloatVector.fromArray(SPECIES, a, i);
 IntVector iv = fv.reinterpretAsInts();
 IntVector biasedExpV = iv.lanewise(VectorOperators.AND, 0x7F80);
 biasedExpV = biasedExpV.lanewise(VectorOperators.ASHR, 23);
 IntVector shiftV = shiftVBC.lanewise(VectorOperators.SUB, biasedExpV);
 VectorMask cond = shiftV.lanewise(VectorOperators.AND, -32)
   .compare(VectorOperators.EQ, 0);
 IntVector res = iv.lanewise(VectorOperators.AND, 0x007F)
   .lanewise(VectorOperators.OR, 0x007F + 1);
 VectorMask cond1 = iv.compare(VectorOperators.LT, 0);
 VectorMask cond2 = cond1.and(cond);
 res = res.lanewise(VectorOperators.NEG, cond2);
 res = res.lanewise(VectorOperators.ASHR, shiftV)
   .lanewise(VectorOperators.ADD, 1)
   .lanewise(VectorOperators.ASHR, 1);
 res = fv.convert(VectorOperators.F2I, 0)
   .reinterpretAsInts()
   .blend(res, cond);
 res.intoArray(r, i);
  }
  return r[ctr];
   }

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-13 Thread Jatin Bhateja
On Sun, 13 Feb 2022 10:58:19 GMT, Andrew Haley  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains four additional 
>> commits since the last revision:
>> 
>>  - 8279508: Adding vectorized algorithms to match the semantics of rounding 
>> operations.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
>>  - 8279508: Adding a test for scalar intrinsification.
>>  - 8279508: Auto-vectorize Math.round API
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4066:
> 
>> 4064: }
>> 4065: 
>> 4066: void 
>> C2_MacroAssembler::vector_cast_double_special_cases_evex(XMMRegister dst, 
>> XMMRegister src, XMMRegister xtmp1,
> 
> What does this do? Comment, even pseudo code, would be nice.

> Hi, IIRC for evex encoding you can embed the RC control bit directly in the 
> evex prefix, removing the need to rely on global MXCSR register. Thanks.

Hi @merykitty ,  You are correct, we can embed RC mode in instruction encoding 
round instructions (towards -inf,+inf, zero). But to match the semantics of 
Math.round API one needs to add 0.5[f] to input value and then perform rounding 
over resultant value, which is why @sviswa7 suggested to use a global rounding 
mode driven by MXCSR.RC so that intermediate floating inexact values also are 
resolved as desired, but OOO execution may misplace LDMXCSR and hence may have 
undesired side effects.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v3]

2022-02-12 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
> 
> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | -- | --
> FpRoundingBenchmark.test_round_double | 1024.00 | 584.99 | 1870.70 | 3.20 | 
> 510.35 | 548.60 | 1.07
> FpRoundingBenchmark.test_round_double | 2048.00 | 257.17 | 965.33 | 3.75 | 
> 293.60 | 273.15 | 0.93
> FpRoundingBenchmark.test_round_float | 1024.00 | 825.69 | 3592.54 | 4.35 | 
> 825.32 | 1836.42 | 2.23
> FpRoundingBenchmark.test_round_float | 2048.00 | 388.55 | 1895.77 | 4.88 | 
> 412.31 | 945.82 | 2.29
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains four additional 
commits since the last revision:

 - 8279508: Adding vectorized algorithms to match the semantics of rounding 
operations.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8279508
 - 8279508: Adding a test for scalar intrinsification.
 - 8279508: Auto-vectorize Math.round API

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/575d2935..2dc364fa

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=01-02

  Stats: 33695 lines in 1192 files changed: 23243 ins; 5703 del; 4749 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v2]

2022-02-12 Thread Jatin Bhateja
On Fri, 21 Jan 2022 00:49:04 GMT, Sandhya Viswanathan 
 wrote:

> The JVM currently initializes the x86 mxcsr to round to nearest even, see 
> below in stubGenerator_x86_64.cpp: // Round to nearest (even), 64-bit mode, 
> exceptions masked StubRoutines::x86::_mxcsr_std = 0x1F80; The above works for 
> Math.rint which is specified to be round to nearest even. Please see: 
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
>  : section 4.8.4
> 
> The rounding mode needed for Math.round is round to positive infinity which 
> needs a different x86 mxcsr initialization(0x5F80).

Hi @sviswa7 ,
As per JLS 17 section 15.4 Java follows round to nearest rounding policy for 
all floating point operations except conversion to integer and remainder where 
it uses round toward zero.  

So it may not be feasible to modify global MXCSR.RC setting,  also modifying 
MXCSR setting just before rounding and re-setting back to its original value 
after operation will also not work as OOO processor is free to re-order LMXCSR 
instruction if used without any barriers and thus it may also influence other 
floating point operation. 
I am pushing an incremental patch which is vectorizes existing rounding APIs 
and is showing significant gain over existing implementation.

Best Regards,
Jatin

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8278173: [vectorapi] Add x64 intrinsics for unsigned (zero extended) casts

2022-02-10 Thread Jatin Bhateja
On Sat, 5 Feb 2022 15:34:08 GMT, Quan Anh Mai  wrote:

> Hi,
> 
> This patch implements the unsigned upcast intrinsics in x86, which are used 
> in vector lane-wise reinterpreting operations.
> 
> Thank you very much.

src/hotspot/cpu/x86/x86.ad line 7288:

> 7286: break;
> 7287:   default: assert(false, "%s", type2name(to_elem_bt));
> 7288: }

Please move this into a macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 7310:

> 7308:   default: assert(false, "%s", type2name(to_elem_bt));
> 7309: }
> 7310:   %}

Same as above.

-

PR: https://git.openjdk.java.net/jdk/pull/7358


Re: RFR: 8279508: Auto-vectorize Math.round API [v2]

2022-01-19 Thread Jatin Bhateja
> Summary of changes:
> - Intrinsify Math.round(float) and Math.round(double) APIs.
> - Extend auto-vectorizer to infer vector operations on encountering scalar IR 
> nodes for above intrinsics.
> - Test creation using new IR testing framework.
> 
> Following are the performance number of a JMH micro included with the patch 
> 
> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
> 
>   |   | BASELINE AVX2 | WithOpt AVX2 | Gain (opt/baseline) | Baseline AVX3 | 
> Withopt AVX3 | Gain (opt/baseline)
> -- | -- | -- | -- | -- | -- | -- | --
> Benchmark | ARRAYLEN | Score (ops/ms) | Score (ops/ms) |   | Score (ops/ms) | 
> Score (ops/ms) |  
> FpRoundingBenchmark.test_round_double | 1024 | 518.532 | 1364.066 | 
> 2.630630318 | 512.908 | 4292.11 | 8.368186887
> FpRoundingBenchmark.test_round_double | 2048 | 270.137 | 830.986 | 
> 3.076165057 | 273.159 | 2459.116 | 9.002507697
> FpRoundingBenchmark.test_round_float | 1024 | 752.436 | 7780.905 | 
> 10.34095259 | 752.49 | 9506.694 | 12.63364829
> FpRoundingBenchmark.test_round_float | 2048 | 389.499 | 4113.046 | 
> 10.55983712 | 389.63 | 4863.673 | 12.48279907
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8279508: Adding a test for scalar intrinsification.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7094/files
  - new: https://git.openjdk.java.net/jdk/pull/7094/files/0fe01504..575d2935

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=7094=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=7094=00-01

  Stats: 2 lines in 2 files changed: 2 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API

2022-01-15 Thread Jatin Bhateja
On Sun, 16 Jan 2022 02:23:15 GMT, Quan Anh Mai  wrote:

> Hi, did we have tests for the scalar intrinsification already? Thanks.

Verification is done against scalar rounding operation.
https://github.com/openjdk/jdk/pull/7094/files#diff-88b1bad16d68808e6c1224fff7773104924bfdabcb23958c2a3e4e6b06844701R369

Thanks

-

PR: https://git.openjdk.java.net/jdk/pull/7094


RFR: 8279508: Auto-vectorize Math.round API

2022-01-14 Thread Jatin Bhateja
Summary of changes:
- Intrinsify Math.round(float) and Math.round(double) APIs.
- Extend auto-vectorizer to infer vector operations on encountering scalar IR 
nodes for above intrinsics.
- Test creation using new IR testing framework.

Following are the performance number of a JMH micro included with the patch 

Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)

  |   | BASELINE AVX2 | WithOpt AVX2 | Gain (opt/baseline) | Baseline AVX3 | 
Withopt AVX3 | Gain (opt/baseline)
-- | -- | -- | -- | -- | -- | -- | --
Benchmark | ARRAYLEN | Score (ops/ms) | Score (ops/ms) |   | Score (ops/ms) | 
Score (ops/ms) |  
FpRoundingBenchmark.test_round_double | 1024 | 518.532 | 1364.066 | 2.630630318 
| 512.908 | 4292.11 | 8.368186887
FpRoundingBenchmark.test_round_double | 2048 | 270.137 | 830.986 | 3.076165057 
| 273.159 | 2459.116 | 9.002507697
FpRoundingBenchmark.test_round_float | 1024 | 752.436 | 7780.905 | 10.34095259 
| 752.49 | 9506.694 | 12.63364829
FpRoundingBenchmark.test_round_float | 2048 | 389.499 | 4113.046 | 10.55983712 
| 389.63 | 4863.673 | 12.48279907

Kindly review and share your feedback.

Best Regards,
Jatin

-

Commit messages:
 - 8279508: Auto-vectorize Math.round API

Changes: https://git.openjdk.java.net/jdk/pull/7094/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=7094=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8279508
  Stats: 409 lines in 22 files changed: 342 ins; 1 del; 66 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7094.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7094/head:pull/7094

PR: https://git.openjdk.java.net/jdk/pull/7094


Integrated: 8273322: Enhance macro logic optimization for masked logic operations.

2022-01-06 Thread Jatin Bhateja
On Mon, 20 Dec 2021 13:33:01 GMT, Jatin Bhateja  wrote:

> Patch extends existing macrologic inferencing algorithm to handle masked 
> logic operations.
> 
> Existing algorithm:
> 
> 1. Identify logic cone roots.
> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
> traversal if input constraint are met.
> i.e. maximum number of inputs which a macro logic node can have.
> 3. Perform symbolic evaluation of logic expression tree by assigning value 
> corresponding to a truth table column
> to each input.
> 4. Inputs along with encoded function together represents a macro logic node 
> which mimics a truth table.
> 
> Modification:
> Extended the packing algorithm to operate on both predicated or 
> non-predicated logic nodes. Following
> rules define the criteria under which nodes gets packed into a macro logic 
> node:-
> 
> 1. Parent and both child nodes are all unmasked or masked with same 
> predicates.
> 2. Masked parent can be packed with left child if it is predicated and both 
> have same prediates.
> 3. Masked parent can be packed with right child if its un-predicated or has 
> matching predication condition.
> 4. An unmasked parent can be packed with an unmasked child.
> 
> New jtreg test case added with the patch exhaustively covers all the 
> different combinations of predications of parent and
> child nodes.
> 
> Following are the performance number for JMH benchmark included with the 
> patch.
> 
> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
> Icelake Server)
> 
> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
> withopt/baseline)
> -- | -- | -- | -- | --
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 | 
> 2.171403315
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
> 2.002547072
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
> | 1.792558013
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 | 
> 1.882536419
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
> 1.560787454
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
> 2.022003377
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
> 1.63814064
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
> 1.384211046
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
> 1.140933774
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
> 1.121276084
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
> 1.205791374
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 | 
> 1.087654397
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
> | 1.002939661
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
> 1.031267884
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 | 
> 1.030794717
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
> | 3435.989 | 4418.09 | 1.285827749
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
> | 1524.803 | 1678.201 | 1.100601848
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 1024 
> | 972.501 | 1166.734 | 1.199725244
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
> | 5980.85 | 7584.17 | 1.268075608
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
> | 3258.108 | 3939.23 | 1.209054457
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 1024 
> | 1475.365 | 1511.159 | 1.024261115
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
> | 4208.766 | 4220.678 | 1.002830283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
> | 2056.651 | 2049.489 | 0.99651764
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 1024 
> | 1110.461 | 1116.448 | 1.005391455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 256 
> | 3259.348 | 3947.94 | 1.211266793
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 512 
> | 1515.147 | 1536.647 | 1.014190042
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
> 1024 | 911.58 | 1030.54 | 1.130498695
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 256 
> | 2034.611 | 2073.764 | 1.019243482
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlend

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v5]

2022-01-06 Thread Jatin Bhateja
.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
> 1024 | 559.269 | 559.651 | 1.000683034
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 
> | 3636.141 | 4446.505 | 1.222863745
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 
> | 1433.145 | 1681.261 | 1.173126934
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 
> | 1000.107 | 1172.866 | 1.172740517
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 256 
> | 5568.313 | 7670.259 | 1.37748345
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 512 
> | 3350.108 | 3927.803 | 1.172440709
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 1024 
> | 1495.966 | 1541.56 | 1.030477965
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 256 
> | 4230.379 | 4282.154 | 1.012238856
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 512 
> | 2029.801 | 2049.638 | 1.009772879
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 1024 
> | 1108.738 | 1118.897 | 1.00916267
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 256 
> | 3802.801 | 3783.537 | 0.99493426
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 512 
> | 1546.244 | 1552.691 | 1.004169458
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 
> 1024 | 1017.512 | 1020.075 | 1.002518889
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 256 | 4159.835 | 4527.676 | 1.088426825
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 512 | 1665.335 | 1733.04 | 1.040655484
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 1024 | 1150.319 | 1181.935 | 1.02748455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 256 | 6989.791 | 7382.883 | 1.056238019
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 512 | 3711.362 | 3911.921 | 1.054039191
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 1024 | 1540.341 | 1554.175 | 1.008981128
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 256 | 4164.559 | 4213.546 | 1.01176283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 512 | 2072.91 | 2079.105 | 1.002988552
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 1024 | 1112.678 | 1116.675 | 1.003592234
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 256 | 3702.998 | 3906.093 | 1.0548461
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 512 | 1536.571 | 1546.043 | 1.006164375
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 1024 | 996.906 | 1013.649 | 1.016794964
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 256 | 2045.594 | 2048.966 | 1.001648421
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 512 | .933 | 1117.689 | 1.005176571
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 1024 | 559.971 | 561.144 | 1.002094751
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8273322: Adding missing randomness key.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6893/files
  - new: https://git.openjdk.java.net/jdk/pull/6893/files/f101fff7..2d196f71

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6893=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6893=03-04

  Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6893.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6893/head:pull/6893

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-06 Thread Jatin Bhateja
On Thu, 6 Jan 2022 17:39:20 GMT, Sandhya Viswanathan  
wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8273322: Review comments resolution.
>
> test/hotspot/jtreg/compiler/vectorapi/TestMaskedMacroLogicVector.java line 26:
> 
>> 24: /**
>> 25:  * @test
>> 26:  * @bug 8273322
> 
> Needs  @key randomness as we use random number without a fixed seed here.
> Please see:
> https://openjdk.java.net/jtreg/faq.html#when-should-i-use-the-intermittent-or-randomness-keyword-in-a-test

DONE

-

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]

2022-01-05 Thread Jatin Bhateja
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOp

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-05 Thread Jatin Bhateja
.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
> 1024 | 559.269 | 559.651 | 1.000683034
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 
> | 3636.141 | 4446.505 | 1.222863745
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 
> | 1433.145 | 1681.261 | 1.173126934
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 
> | 1000.107 | 1172.866 | 1.172740517
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 256 
> | 5568.313 | 7670.259 | 1.37748345
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 512 
> | 3350.108 | 3927.803 | 1.172440709
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 1024 
> | 1495.966 | 1541.56 | 1.030477965
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 256 
> | 4230.379 | 4282.154 | 1.012238856
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 512 
> | 2029.801 | 2049.638 | 1.009772879
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 1024 
> | 1108.738 | 1118.897 | 1.00916267
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 256 
> | 3802.801 | 3783.537 | 0.99493426
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 512 
> | 1546.244 | 1552.691 | 1.004169458
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 
> 1024 | 1017.512 | 1020.075 | 1.002518889
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 256 | 4159.835 | 4527.676 | 1.088426825
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 512 | 1665.335 | 1733.04 | 1.040655484
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 1024 | 1150.319 | 1181.935 | 1.02748455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 256 | 6989.791 | 7382.883 | 1.056238019
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 512 | 3711.362 | 3911.921 | 1.054039191
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 1024 | 1540.341 | 1554.175 | 1.008981128
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 256 | 4164.559 | 4213.546 | 1.01176283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 512 | 2072.91 | 2079.105 | 1.002988552
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 1024 | 1112.678 | 1116.675 | 1.003592234
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 256 | 3702.998 | 3906.093 | 1.0548461
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 512 | 1536.571 | 1546.043 | 1.006164375
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 1024 | 996.906 | 1013.649 | 1.016794964
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 256 | 2045.594 | 2048.966 | 1.001648421
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 512 | .933 | 1117.689 | 1.005176571
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 1024 | 559.971 | 561.144 | 1.002094751
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8273322: Review comments resolution.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6893/files
  - new: https://git.openjdk.java.net/jdk/pull/6893/files/d18f504f..f101fff7

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6893=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6893=02-03

  Stats: 15 lines in 4 files changed: 1 ins; 4 del; 10 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6893.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6893/head:pull/6893

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v2]

2022-01-05 Thread Jatin Bhateja
On Tue, 4 Jan 2022 02:25:36 GMT, Vladimir Kozlov  wrote:

> I think whole "Bitwise operation packing optimization" code should be moved 
> out from `compile.cpp`. May be to `vectornode.cpp where `MacroLogicVNode` 
> code is located.
> 
Hi @vnkozlov ,
Yes we can also extended AndV/OrV/XorV/AndVMask/OrVMask/XorVMask idealizations 
to perform macro logic folding, 
current changes keeps the implementation clean and limited to one optimization 
stage.

> Copyright year should be updated to 2022 in all changed files.

-

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]

2022-01-04 Thread Jatin Bhateja
.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
> 1024 | 559.269 | 559.651 | 1.000683034
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 
> | 3636.141 | 4446.505 | 1.222863745
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 
> | 1433.145 | 1681.261 | 1.173126934
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 
> | 1000.107 | 1172.866 | 1.172740517
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 256 
> | 5568.313 | 7670.259 | 1.37748345
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 512 
> | 3350.108 | 3927.803 | 1.172440709
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 1024 
> | 1495.966 | 1541.56 | 1.030477965
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 256 
> | 4230.379 | 4282.154 | 1.012238856
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 512 
> | 2029.801 | 2049.638 | 1.009772879
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 1024 
> | 1108.738 | 1118.897 | 1.00916267
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 256 
> | 3802.801 | 3783.537 | 0.99493426
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 512 
> | 1546.244 | 1552.691 | 1.004169458
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 
> 1024 | 1017.512 | 1020.075 | 1.002518889
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 256 | 4159.835 | 4527.676 | 1.088426825
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 512 | 1665.335 | 1733.04 | 1.040655484
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 1024 | 1150.319 | 1181.935 | 1.02748455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 256 | 6989.791 | 7382.883 | 1.056238019
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 512 | 3711.362 | 3911.921 | 1.054039191
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 1024 | 1540.341 | 1554.175 | 1.008981128
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 256 | 4164.559 | 4213.546 | 1.01176283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 512 | 2072.91 | 2079.105 | 1.002988552
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 1024 | 1112.678 | 1116.675 | 1.003592234
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 256 | 3702.998 | 3906.093 | 1.0548461
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 512 | 1536.571 | 1546.043 | 1.006164375
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 1024 | 996.906 | 1013.649 | 1.016794964
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 256 | 2045.594 | 2048.966 | 1.001648421
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 512 | .933 | 1117.689 | 1.005176571
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 1024 | 559.971 | 561.144 | 1.002094751
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8273322: Updating copywrite header.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6893/files
  - new: https://git.openjdk.java.net/jdk/pull/6893/files/f8120acb..d18f504f

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6893=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6893=01-02

  Stats: 12 lines in 12 files changed: 0 ins; 0 del; 12 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6893.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6893/head:pull/6893

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v2]

2022-01-04 Thread Jatin Bhateja
On Tue, 4 Jan 2022 02:21:35 GMT, Vladimir Kozlov  wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains two additional 
>> commits since the last revision:
>> 
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8273322
>>  - 8273322: Enhance macro logic optimization for masked logic operations.
>
> src/hotspot/cpu/x86/x86.ad line 1900:
> 
>> 1898: 
>> 1899: case Op_MacroLogicV:
>> 1900:   if(bt != T_INT && bt != T_LONG) {
> 
> Missing `VM_Version::supports_evex()` check?

Hi @vnkozlov, we already have that check (UseAVX < 3) in match_rule_supported 
routine which gets called from this function.

-

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v2]

2022-01-03 Thread Jatin Bhateja
.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
> 1024 | 559.269 | 559.651 | 1.000683034
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 
> | 3636.141 | 4446.505 | 1.222863745
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 
> | 1433.145 | 1681.261 | 1.173126934
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 
> | 1000.107 | 1172.866 | 1.172740517
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 256 
> | 5568.313 | 7670.259 | 1.37748345
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 512 
> | 3350.108 | 3927.803 | 1.172440709
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 1024 
> | 1495.966 | 1541.56 | 1.030477965
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 256 
> | 4230.379 | 4282.154 | 1.012238856
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 512 
> | 2029.801 | 2049.638 | 1.009772879
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 1024 
> | 1108.738 | 1118.897 | 1.00916267
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 256 
> | 3802.801 | 3783.537 | 0.99493426
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 512 
> | 1546.244 | 1552.691 | 1.004169458
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 
> 1024 | 1017.512 | 1020.075 | 1.002518889
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 256 | 4159.835 | 4527.676 | 1.088426825
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 512 | 1665.335 | 1733.04 | 1.040655484
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128
>  | 1024 | 1150.319 | 1181.935 | 1.02748455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 256 | 6989.791 | 7382.883 | 1.056238019
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 512 | 3711.362 | 3911.921 | 1.054039191
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256
>  | 1024 | 1540.341 | 1554.175 | 1.008981128
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 256 | 4164.559 | 4213.546 | 1.01176283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 512 | 2072.91 | 2079.105 | 1.002988552
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512
>  | 1024 | 1112.678 | 1116.675 | 1.003592234
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 256 | 3702.998 | 3906.093 | 1.0548461
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 512 | 1536.571 | 1546.043 | 1.006164375
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256
>  | 1024 | 996.906 | 1013.649 | 1.016794964
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 256 | 2045.594 | 2048.966 | 1.001648421
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 512 | .933 | 1117.689 | 1.005176571
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512
>  | 1024 | 559.971 | 561.144 | 1.002094751
> 
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains two additional 
commits since the last revision:

 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8273322
 - 8273322: Enhance macro logic optimization for masked logic operations.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6893/files
  - new: https://git.openjdk.java.net/jdk/pull/6893/files/b14079e9..f8120acb

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6893=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6893=00-01

  Stats: 6814 lines in 274 files changed: 5024 ins; 944 del; 846 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6893.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6893/head:pull/6893

PR: https://git.openjdk.java.net/jdk/pull/6893


RFR: 8273322: Enhance macro logic optimization for masked logic operations.

2021-12-20 Thread Jatin Bhateja
Patch extends existing macrologic inferencing algorithm to handle masked logic 
operations.

Existing algorithm:

1. Identify logic cone roots.
2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
traversal if input constraint are met.
i.e. maximum number of inputs which a macro logic node can have.
3. Perform symbolic evaluation of logic expression tree by assigning value 
corresponding to a truth table column
to each input.
4. Inputs along with encoded function together represents a macro logic node 
which mimics a truth table.

Modification:
Extended the packing algorithm to operate on both predicated or non-predicated 
logic nodes. Following
rules define the criteria under which nodes gets packed into a macro logic 
node:-

1. Parent and both child nodes are all unmasked or masked with same predicates.
2. Masked parent can be packed with left child if it is predicated and both 
have same prediates.
3. Masked parent can be packed with right child if its un-predicated or has 
matching predication condition.
4. An unmasked parent can be packed with an unmasked child.

New jtreg test case added with the patch exhaustively covers all the different 
combinations of predications of parent and
child nodes.

Following are the performance number for JMH benchmark included with the patch.

Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
Icelake Server)

Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
withopt/baseline)
-- | -- | -- | -- | --
o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 | 
2.171403315
o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
2.002547072
o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 | 
1.792558013
o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 | 
1.882536419
o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
1.560787454
o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
2.022003377
o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
1.63814064
o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
1.384211046
o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
1.140933774
o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
1.121276084
o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
1.205791374
o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 | 
1.087654397
o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 | 
1.002939661
o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
1.031267884
o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 | 
1.030794717
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 | 
3435.989 | 4418.09 | 1.285827749
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 | 
1524.803 | 1678.201 | 1.100601848
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 1024 | 
972.501 | 1166.734 | 1.199725244
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 | 
5980.85 | 7584.17 | 1.268075608
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 | 
3258.108 | 3939.23 | 1.209054457
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 1024 | 
1475.365 | 1511.159 | 1.024261115
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 | 
4208.766 | 4220.678 | 1.002830283
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 | 
2056.651 | 2049.489 | 0.99651764
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 1024 | 
1110.461 | 1116.448 | 1.005391455
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 256 | 
3259.348 | 3947.94 | 1.211266793
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 512 | 
1515.147 | 1536.647 | 1.014190042
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 1024 
| 911.58 | 1030.54 | 1.130498695
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 256 | 
2034.611 | 2073.764 | 1.019243482
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 512 | 
1110.659 | 1116.093 | 1.004892591
o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 1024 
| 559.269 | 559.651 | 1.000683034
o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 | 
3636.141 | 4446.505 | 1.222863745
o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 | 
1433.145 | 1681.261 | 1.173126934
o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 | 
1000.107 | 1172.866 | 1.172740517

Re: RFR: 8271368: [BACKOUT] JDK-8266054 VectorAPI rotate operation optimization

2021-07-28 Thread Jatin Bhateja
On Wed, 28 Jul 2021 05:35:59 GMT, Vladimir Kozlov  wrote:

> Backout the following changes due to vector tests failures in tier 2 and 
> later: 
> [JDK-8266054](https://bugs.openjdk.java.net/browse/JDK-8266054) VectorAPI 
> rotate operation optimization 
> 
> Changes also caused copyright header validation failure in Tier1 due to 
> missing `,` after copyright year in new test.
> 
> Currently running testing.

- Thanks for reporting it, should it be ok to move those tests to 
ProblemList.txt and let me fix this as a follow up issue instead of a revert ?

-

PR: https://git.openjdk.java.net/jdk/pull/4915


Integrated: 8266054: VectorAPI rotate operation optimization

2021-07-27 Thread Jatin Bhateja
On Tue, 27 Apr 2021 17:56:04 GMT, Jatin Bhateja  wrote:

> Current VectorAPI Java side implementation expresses rotateLeft and 
> rotateRight operation using following operations:-
> 
> vec1 = lanewise(VectorOperators.LSHL, n)
> vec2 = lanewise(VectorOperators.LSHR, n)
> res = lanewise(VectorOperations.OR, vec1 , vec2)
> 
> This patch moves above handling from Java side to C2 compiler which 
> facilitates dismantling the rotate operation if target ISA does not support a 
> direct rotate instruction.
> 
> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
> long and integer type vectors. For other cases (i.e. sub-word type vectors or 
> for targets which do not support direct rotate operations )   instruction 
> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
> 
> Please find below the performance data for included JMH benchmark.
> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
> 
> 
>  xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40;>
> 
> 
> 
> 
> 
>  href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>  href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Benchmark | (bits) | (shift) | (size) | Baseline Score (ops/ms) | With Opts 
> (ops/ms) | Gain
> -- | -- | -- | -- | -- | -- | --
> RotateBenchmark.testRotateLeftB | 128 | 7 | 256 | 3939.136 | 3836.133 | 
> 0.973851372
> RotateBenchmark.testRotateLeftB | 128 | 7 | 512 | 1984.231 | 1918.27 | 
> 0.966757399
> RotateBenchmark.testRotateLeftB | 128 | 15 | 256 | 3925.165 | 4043.842 | 
> 1.030234907
> RotateBenchmark.testRotateLeftB | 128 | 15 | 512 | 1962.723 | 1936.551 | 
> 0.986665464
> RotateBenchmark.testRotateLeftB | 128 | 31 | 256 | 3945.6 | 3817.883 | 
> 0.967630525
> RotateBenchmark.testRotateLeftB | 128 | 31 | 512 | 1944.458 | 1914.229 | 
> 0.984453766
> RotateBenchmark.testRotateLeftB | 256 | 7 | 256 | 4612.149 | 4514.874 | 
> 0.978908964
> RotateBenchmark.testRotateLeftB | 256 | 7 | 512 | 2296.252 | 2270.237 | 
> 0.988670669
> RotateBenchmark.testRotateLeftB | 256 | 15 | 256 | 4576.628 | 4515.53 | 
> 0.986649996
> RotateBenchmark.testRotateLeftB | 256 | 15 | 512 | 2288.278 | 2270.923 | 
> 0.992415694
> RotateBenchmark.testRotateLeftB | 256 | 31 | 256 | 4624.243 | 4511.46 | 
> 0.975610495
> RotateBenchmark.testRotateLeftB | 256 | 31 | 512 | 2305.459 | 2273.788 | 
> 0.986262605
> RotateBenchmark.testRotateLeftB | 512 | 7 | 256 | 7748.283 | .105 | 
> 1.003719792
> RotateBenchmark.testRotateLeftB | 512 | 7 | 512 | 3906.214 | 3912.647 | 
> 1.001646863
> RotateBenchmark.testRotateLeftB | 512 | 15 | 256 | 7764.653 | 7763.482 | 
> 0.999849188
> RotateBenchmark.testRotateLeftB | 512 | 15 | 512 | 3916.061 | 3919.363 | 
> 1.000843194
> RotateBenchmark.testRotateLeftB | 512 | 31 | 256 | 7779.754 | 7770.239 | 
> 0.998776954
> RotateBenchmark.testRotateLeftB | 512 | 31 | 512 | 3916.471 | 3912.718 | 
> 0.999041739
> RotateBenchmark.testRotateLeftI | 128 | 7 | 256 | 4043.39 | 13461.814 | 
> 3.329338501
> RotateBenchmark.testRotateLeftI | 128 | 7 | 512 | 1996.217 | 6455.425 | 
> 3.233829288
> RotateBenchmark.testRotateLeftI | 128 | 15 | 256 | 4028.614 | 13077.277 | 
> 3.246098286
> RotateBenchmark.testRotateLeftI | 128 | 15 | 512 | 1997.612 | 6452.918 | 
> 3.230315997
> RotateBenchmark.testRotateLeftI | 128 | 31 | 256 | 4123.357 | 13079.045 | 
> 3.171940969
> RotateBenchmark.testRotateLeftI | 128 | 31 | 512 | 2003.356 | 6452.716 | 
> 3.22095324
> RotateBenchmark.testRotateLeftI | 256 | 7 | 256 | 7666.949 | 25658.625 | 
> 3.34665393
> RotateBenchmark.testRotateLeftI | 256 | 7 | 512 | 3855.826 | 12278.106 | 
> 3.18429981
> RotateBenchmark.testRotateLeftI | 256 | 15 | 256 | 7670.901 | 24625.466 | 
> 3.210244272
> RotateBenchmark.testRotateLeftI | 256 | 15 | 512 | 3765.786 | 12272.771 | 
> 3.259019764
> RotateBenchmark.testRotateLeftI | 256 | 31 | 256 | 7660.599 | 25678.864 | 
> 3.352069988
> RotateBenchmark.testRotateLeftI | 256 | 31 | 512 | 3773.401 | 12006.469 | 
> 3.181869353
> RotateBenchmark.testRotateLeftI | 512 | 7 | 256 | 11900.948 | 31242.989 | 
> 2.625252123
> RotateBenchmark.testRotateLeftI | 512 | 7 | 512 | 5830.878 | 15727.149 | 
> 2.697217983
> RotateBenchmark.testRotateLeftI | 512 | 15 | 256 | 12171.847 | 33180.067 | 
> 2.72596813
> RotateBenchmark.testRotateLeftI | 512 | 15 | 512 | 5830.544 | 16740.182 | 
> 2.871118372
> RotateBenchmark.testRotateLeftI | 512 | 31 | 256 | 11909.553 

Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-27 Thread Jatin Bhateja
On Tue, 27 Jul 2021 00:24:52 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 19 commits:
>> 
>>  - 8266054: Re-designing benchmark to remove noise.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - 8266054: Formal argument name change to be more appropriate.
>>  - 8266054: Review comments resolution.
>>  - 8266054: Incorporating styling changes based on reviews.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
>> JDK-8266054
>>  - ... and 9 more: 
>> https://git.openjdk.java.net/jdk/compare/a8f15427...b20404e2
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1598:
> 
>> 1596:   cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) 
>> : cnt;
>> 1597:   opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, 
>> type_bt));
>> 1598: } else {
> 
> Why conversion for only T_LONG and not for T_BYTE and T_SHORT? Is there an 
> assumption here that only T_INT and T_LONG elem_bt are supported?

Correcting this,  I2L may be needed in auto-vectorization flow since 
Integer/Long.rotate[Right/Left] APIs accept only integral shift, so for 
Long.rotate* operations integral shift value must be converted to long using 
I2L before broadcasting it. VectorAPI lanewise operations between 
vector-scalar, scalar type already matches with vector type.  Since 
degeneration routine is common b/w both the flows so maintaining IR consistency 
here.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-27 Thread Jatin Bhateja
On Tue, 27 Jul 2021 02:52:13 GMT, Eric Liu  wrote:

>> @sviswa7, SLP flow will either have a constant 8bit shift value or a 
>> variable shift present in vector, this also include broadcasted non-constant 
>> shift value or a shift value beyond 8 bit.
>
> It would be better comment here, since the correctness relay on some others.

@theRealELiu , @sviswa7 , comment already exist in code,  I guess I mentioned 
incorrectly earlier on this thread, rectified my comments.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-27 Thread Jatin Bhateja
On Tue, 27 Jul 2021 01:54:01 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 19 commits:
>> 
>>  - 8266054: Re-designing benchmark to remove noise.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - 8266054: Formal argument name change to be more appropriate.
>>  - 8266054: Review comments resolution.
>>  - 8266054: Incorporating styling changes based on reviews.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
>> JDK-8266054
>>  - ... and 9 more: 
>> https://git.openjdk.java.net/jdk/compare/a8f15427...b20404e2
>
> src/hotspot/share/opto/vectornode.cpp line 1199:
> 
>> 1197:  
>> (Node*)(phase->intcon(shift_mask + 1));
>> 1198: Node* vector_mask = 
>> phase->transform(VectorNode::scalar2vector(shift_mask_node,vlen, elem_ty));
>> 1199: int subVopc = VectorNode::opcode((bt == T_LONG) ? Op_SubL : 
>> Op_SubI, bt);
> 
> There seems to be an assumption here that the vector type is INT or LONG only 
> and not subword type. From Vector API you can get the sub word types as well.
> Also if this path is coming from auto-vectorizer, don't we need masking here?

Subtype is being passed to VectorNode::opcode for correct opcode selection. 
Also shift_mask_node is a constant value node, so there is no assumption on 
vector type. Wrap around (masking) for shift value may not be needed here since 
we are degenerating rotate into shifts (logical left and rights).

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Jatin Bhateja
On Mon, 26 Jul 2021 17:19:07 GMT, Sandhya Viswanathan 
 wrote:

>> And'ing with shift_mask is already done on Java API side implementation 
>> before making a call to intrinsic rountine.
>
> @jatin-bhateja  This question is still pending.

@sviswa7, SLP flow will either have a constant 8bit shift value or a variable 
shift present in vector. So non constant scalar case will not be hit through 
this route.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Jatin Bhateja
On Mon, 26 Jul 2021 17:19:07 GMT, Sandhya Viswanathan 
 wrote:

>> And'ing with shift_mask is already done on Java API side implementation 
>> before making a call to intrinsic rountine.
>
> @jatin-bhateja  This question is still pending.

Other than VectorAPI , SLP also infers vector rotates where shift is either a 
8bit constant or variable shift present in vector. So this case of scalar 
non-constant shift will not be hit for non-vectorAPI case.
Also it will be illegal to perform any wrap around for shifts.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v12]

2021-07-18 Thread Jatin Bhateja
4.15 | 177.36
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 8157.94 | 8466.90 | 3.79 
> | 450.26 | 1221.90 | 171.37
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 4039.74 | 4283.33 | 6.03 
> | 224.82 | 612.68 | 172.53
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 2066.88 | 2147.51 | 3.90 
> | 110.97 | 303.43 | 173.42
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 13548.39 | 13245.87 | 
> -2.23 | 13490.93 | 13084.76 | -3.01
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 7020.16 | 6768.85 | -3.58 
> | 6991.39 | 7044.32 | 0.76
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 3550.50 | 3505.19 | -1.28 
> | 3507.12 | 3612.86 | 3.01
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 13743.43 | 13325.44 | 
> -3.04 | 13696.15 | 13255.80 | -3.22
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 6856.02 | 6969.18 | 1.65 
> | 6886.29 | 6834.12 | -0.76
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 3569.53 | 3492.76 | -2.15 
> | 3539.02 | 3470.02 | -1.95
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 13704.18 | 13495.07 | 
> -1.53 | 13649.14 | 13583.87 | -0.48
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 7011.77 | 6953.93 | -0.82 
> | 6978.28 | 6740.30 | -3.41
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 3591.62 | 3620.12 | 0.79 
> | 3502.04 | 3510.05 | 0.23
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 21950.71 | 22113.60 | 0.74 
> | 21484.27 | 21596.64 | 0.52
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 11616.88 | 11099.73 | 
> -4.45 | 11188.29 | 10737.68 | -4.03
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 5872.72 | 5579.12 | -5.00 
> | 5784.05 | 5454.57 | -5.70
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 22017.83 | 20817.97 | 
> -5.45 | 21934.65 | 21356.90 | -2.63
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 11414.27 | 11044.86 | 
> -3.24 | 11454.35 | 11140.34 | -2.74
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 5786.64 | 5634.05 | -2.64 
> | 5724.93 | 5639.99 | -1.48
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 21754.77 | 21466.01 | 
> -1.33 | 21140.67 | 21970.03 | 3.92
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 11676.46 | 11358.64 | 
> -2.72 | 11204.90 | 11213.48 | 0.08
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 5728.20 | 5772.49 | 0.77 
> | 5594.33 | 5544.25 | -0.90
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 30247.03 | 30179.41 | 
> -0.22 | 1538.75 | 3975.82 | 158.38
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 15988.73 | 15621.42 | 
> -2.30 | 776.04 | 1910.91 | 146.24
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 8115.84 | 8025.28 | -1.12 
> | 389.12 | 984.46 | 152.99
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 30110.91 | 30200.69 | 
> 0.30 | 1532.49 | 3983.77 | 159.95
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 15957.90 | 15690.73 | 
> -1.67 | 774.90 | 1931.00 | 149.19
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 8113.26 | 8037.93 | -0.93 
> | 391.90 | 965.53 | 146.37
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 29816.97 | 29891.54 | 
> 0.25 | 1538.12 | 3881.93 | 152.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 15405.95 | 15619.17 | 
> 1.38 | 762.49 | 1871.00 | 145.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 7919.80 | 7957.35 | 0.47 
> | 393.63 | 972.49 | 147.06

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8266054: Formal argument name change to be more appropriate.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3720/files
  - new: https://git.openjdk.java.net/jdk/pull/3720/files/d26caa6a..51c930d7

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3720=11
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3720=10-11

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3720.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v11]

2021-07-18 Thread Jatin Bhateja
On Sun, 18 Jul 2021 20:28:34 GMT, Jatin Bhateja  wrote:

>> Current VectorAPI Java side implementation expresses rotateLeft and 
>> rotateRight operation using following operations:-
>> 
>> vec1 = lanewise(VectorOperators.LSHL, n)
>> vec2 = lanewise(VectorOperators.LSHR, n)
>> res = lanewise(VectorOperations.OR, vec1 , vec2)
>> 
>> This patch moves above handling from Java side to C2 compiler which 
>> facilitates dismantling the rotate operation if target ISA does not support 
>> a direct rotate instruction.
>> 
>> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
>> long and integer type vectors. For other cases (i.e. sub-word type vectors 
>> or for targets which do not support direct rotate operations )   instruction 
>> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
>> 
>> Please find below the performance data for included JMH benchmark.
>> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>> 
>> 
>> Benchmark | (TESTSIZE) | Shift | Baseline AVX3 (ops/ms) | Withopt  AVX3 
>> (ops/ms) | Gain % | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain %
>> -- | -- | -- | -- | -- | -- | -- | -- | --
>>   |   |   |   |   |   |   |   |  
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 17223.35 | 17094.69 | 
>> -0.75 | 17008.32 | 17488.06 | 2.82
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 8944.98 | 8811.34 | -1.49 
>> | 8878.17 | 9218.68 | 3.84
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 17195.75 | 17137.32 | 
>> -0.34 | 16789.01 | 17780.34 | 5.90
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 9052.67 | 8838.60 | -2.36 
>> | 8814.62 | 9206.01 | 4.44
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 17100.19 | 16950.64 | 
>> -0.87 | 16827.73 | 17720.37 | 5.30
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 9079.95 | 8471.26 | -6.70 
>> | .44 | 9167.68 | 3.14
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 21231.33 | 21513.08 | 1.33 
>> | 21824.51 | 21479.48 | -1.58
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 11103.62 | 11180.16 | 0.69 
>> | 11173.67 | 11529.22 | 3.18
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 21119.14 | 21552.04 | 
>> 2.05 | 21693.05 | 21915.37 | 1.02
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 11048.68 | 11094.20 | 
>> 0.41 | 11049.90 | 11439.07 | 3.52
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 21506.31 | 21391.41 | 
>> -0.53 | 21263.18 | 21986.29 | 3.40
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 11056.12 | 11232.78 | 
>> 1.60 | 10941.59 | 11397.09 | 4.16
>> RotateBenchmark.testRotateLeftB | 512.00 | 7.00 | 17976.56 | 18180.85 | 1.14 
>> | 1212.26 | 2533.34 | 108.98
>> RotateBenchmark.testRotateLeftB | 512.00 | 15.00 | 17553.70 | 18219.07 | 
>> 3.79 | 1256.73 | 2537.41 | 101.91
>> RotateBenchmark.testRotateLeftB | 512.00 | 31.00 | 17618.03 | 17738.15 | 
>> 0.68 | 1214.69 | 2533.83 | 108.60
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 7258.87 | 7468.88 | 2.89 | 
>> 7115.12 | 7117.26 | 0.03
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 3586.65 | 3950.85 | 10.15 
>> | 3532.17 | 3595.80 | 1.80
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 1835.07 | 1999.68 | 8.97 | 
>> 1789.90 | 1819.93 | 1.68
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 7273.36 | 7410.91 | 1.89 
>> | 7198.60 | 6994.79 | -2.83
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 3674.98 | 3926.27 | 6.84 
>> | 3549.90 | 3755.09 | 5.78
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 1840.94 | 1882.25 | 2.24 
>> | 1801.56 | 1872.89 | 3.96
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 7457.11 | 7361.48 | -1.28 
>> | 6975.33 | 7385.94 | 5.89
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 3570.74 | 3929.30 | 10.04 
>> | 3635.37 | 3736.67 | 2.79
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 1902.32 | 1960.46 | 3.06 
>> | 1812.32 | 1813.88 | 0.09
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 11174.24 | 12044.52 | 7.79 
>> | 11509.87 | 11273.44 | -2.05
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 5981.47 | 6073.70 | 1.54 | 
>> 5593.66 | 5661.93 | 1.22
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 2932.49 | 3069.54 | 4.67 | 
>> 2950.86 | 2892.42 | -1.98
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 11764.11 | 12098.63 | 
>> 2.84 | 11069.52 | 11476.93 | 3.68
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 5855.20 | 6080.40 | 3.85 
>> | 5919.11 | 5607.04 | -5.27
>&g

Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-18 Thread Jatin Bhateja
On Fri, 16 Jul 2021 00:52:21 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 15 commits:
>> 
>>  - 8266054: Incorporating styling changes based on reviews.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge http://github.com/openjdk/jdk into JDK-8266054
>>  - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
>> JDK-8266054
>>  - 8266054: Removing redundant teat templates.
>>  - 8266054: Code reorganization for efficient sharing of logic to check 
>> rotate operation support on a target platform.
>>  - 8266054: Removing redundant test templates.
>>  - 8266054: Review comments resolution.
>>  - ... and 5 more: 
>> https://git.openjdk.java.net/jdk/compare/07e90524...609c4143
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 84:
> 
>> 82: arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) 
>> {
>> 83:   is_supported = true;
>> 84: }
> 
> When check_bcast is set, is_supported could be false when replicate is not 
> supported. Is replicate not needed for shift+or sequence?

check_bcast is true only when shift value is a non-constant scalar value, in 
that case we need to check for broadcasting operation for shift, in all other 
cases broadcast is not needed.  Constant shift value is an optimizing case 
since AVX512 offers instructions which directly accept 8bit immediate shift 
value.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 86:
> 
>> 84: }
>> 85: return is_supported;
>> 86: }
> 
> Please add comments here why the Left/Right shift and Or opcodes are being 
> checked here. Also add comments why for left shift we are only checking for 
> int and long left shift opcodes whereas for right shift sub word opcodes are 
> being checked.

Both left and right shifts opcodes are selected for all integral types 
(byte/short/int/long). VectorNode::opcode returns the granular left shift type 
based on the sub-type i.e. elem_bt in case of  LeftShiftI.  Re-organizing the 
code for better readability.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 338:
> 
>> 336: // TODO When mask usage is supported, VecMaskNotUsed needs to be 
>> VecMaskUseLoad.
>> 337: if ((sopc != 0) &&
>> 338: !arch_supports_vector(sopc, num_elem, elem_bt, 
>> is_vector_mask(vbox_klass) ? VecMaskUseAll : VecMaskNotUsed)) {
> 
> Could we not call arch_supports_vector_rotate from arch_supports_vector?

DONE

> src/hotspot/share/opto/vectorIntrinsics.cpp line 1563:
> 
>> 1561:  -0x80 <= cnt_type->get_con() && 
>> cnt_type->get_con() < 0x80;
>> 1562:   if (is_rotate) {
>> 1563: if (!arch_supports_vector_rotate(sopc, num_elem, elem_bt, 
>> !is_const_rotate)) {
> 
> What is the relationship between check_bcast and !is_const_rotate? Some 
> comments here on this would help.

Constant shift value is an optimizing case since AVX512 offers instructions 
which directly accept constant shifts in the range (-256, 255). Similar 
handling is done in SLP.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2493

But I feel this is very X86 specific check in generic code, so moving decision 
to a new target specific matcher routine.

> src/hotspot/share/opto/vectorIntrinsics.cpp line 1590:
> 
>> 1588:   opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, 
>> type_bt));
>> 1589: } else {
>> 1590:   // constant shift.
> 
> Did you mean constant rotate here?

Yes.

> src/hotspot/share/opto/vectornode.cpp line 1180:
> 
>> 1178:   cnt = cnt->in(1);
>> 1179: }
>> 1180: shiftRCnt = cnt;
> 
> Why do we remove the And with mask here?

And'ing with shift_mask is already done on Java API side implementation before 
making a call to intrinsic rountine.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v11]

2021-07-18 Thread Jatin Bhateja
4.15 | 177.36
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 8157.94 | 8466.90 | 3.79 
> | 450.26 | 1221.90 | 171.37
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 4039.74 | 4283.33 | 6.03 
> | 224.82 | 612.68 | 172.53
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 2066.88 | 2147.51 | 3.90 
> | 110.97 | 303.43 | 173.42
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 13548.39 | 13245.87 | 
> -2.23 | 13490.93 | 13084.76 | -3.01
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 7020.16 | 6768.85 | -3.58 
> | 6991.39 | 7044.32 | 0.76
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 3550.50 | 3505.19 | -1.28 
> | 3507.12 | 3612.86 | 3.01
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 13743.43 | 13325.44 | 
> -3.04 | 13696.15 | 13255.80 | -3.22
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 6856.02 | 6969.18 | 1.65 
> | 6886.29 | 6834.12 | -0.76
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 3569.53 | 3492.76 | -2.15 
> | 3539.02 | 3470.02 | -1.95
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 13704.18 | 13495.07 | 
> -1.53 | 13649.14 | 13583.87 | -0.48
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 7011.77 | 6953.93 | -0.82 
> | 6978.28 | 6740.30 | -3.41
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 3591.62 | 3620.12 | 0.79 
> | 3502.04 | 3510.05 | 0.23
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 21950.71 | 22113.60 | 0.74 
> | 21484.27 | 21596.64 | 0.52
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 11616.88 | 11099.73 | 
> -4.45 | 11188.29 | 10737.68 | -4.03
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 5872.72 | 5579.12 | -5.00 
> | 5784.05 | 5454.57 | -5.70
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 22017.83 | 20817.97 | 
> -5.45 | 21934.65 | 21356.90 | -2.63
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 11414.27 | 11044.86 | 
> -3.24 | 11454.35 | 11140.34 | -2.74
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 5786.64 | 5634.05 | -2.64 
> | 5724.93 | 5639.99 | -1.48
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 21754.77 | 21466.01 | 
> -1.33 | 21140.67 | 21970.03 | 3.92
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 11676.46 | 11358.64 | 
> -2.72 | 11204.90 | 11213.48 | 0.08
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 5728.20 | 5772.49 | 0.77 
> | 5594.33 | 5544.25 | -0.90
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 30247.03 | 30179.41 | 
> -0.22 | 1538.75 | 3975.82 | 158.38
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 15988.73 | 15621.42 | 
> -2.30 | 776.04 | 1910.91 | 146.24
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 8115.84 | 8025.28 | -1.12 
> | 389.12 | 984.46 | 152.99
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 30110.91 | 30200.69 | 
> 0.30 | 1532.49 | 3983.77 | 159.95
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 15957.90 | 15690.73 | 
> -1.67 | 774.90 | 1931.00 | 149.19
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 8113.26 | 8037.93 | -0.93 
> | 391.90 | 965.53 | 146.37
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 29816.97 | 29891.54 | 
> 0.25 | 1538.12 | 3881.93 | 152.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 15405.95 | 15619.17 | 
> 1.38 | 762.49 | 1871.00 | 145.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 7919.80 | 7957.35 | 0.47 
> | 393.63 | 972.49 | 147.06

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  8266054: Review comments resolution.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3720/files
  - new: https://git.openjdk.java.net/jdk/pull/3720/files/609c4143..d26caa6a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3720=10
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3720=09-10

  Stats: 165 lines in 10 files changed: 68 ins; 23 del; 74 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3720.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-15 Thread Jatin Bhateja
4.15 | 177.36
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 8157.94 | 8466.90 | 3.79 
> | 450.26 | 1221.90 | 171.37
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 4039.74 | 4283.33 | 6.03 
> | 224.82 | 612.68 | 172.53
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 2066.88 | 2147.51 | 3.90 
> | 110.97 | 303.43 | 173.42
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 13548.39 | 13245.87 | 
> -2.23 | 13490.93 | 13084.76 | -3.01
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 7020.16 | 6768.85 | -3.58 
> | 6991.39 | 7044.32 | 0.76
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 3550.50 | 3505.19 | -1.28 
> | 3507.12 | 3612.86 | 3.01
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 13743.43 | 13325.44 | 
> -3.04 | 13696.15 | 13255.80 | -3.22
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 6856.02 | 6969.18 | 1.65 
> | 6886.29 | 6834.12 | -0.76
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 3569.53 | 3492.76 | -2.15 
> | 3539.02 | 3470.02 | -1.95
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 13704.18 | 13495.07 | 
> -1.53 | 13649.14 | 13583.87 | -0.48
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 7011.77 | 6953.93 | -0.82 
> | 6978.28 | 6740.30 | -3.41
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 3591.62 | 3620.12 | 0.79 
> | 3502.04 | 3510.05 | 0.23
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 21950.71 | 22113.60 | 0.74 
> | 21484.27 | 21596.64 | 0.52
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 11616.88 | 11099.73 | 
> -4.45 | 11188.29 | 10737.68 | -4.03
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 5872.72 | 5579.12 | -5.00 
> | 5784.05 | 5454.57 | -5.70
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 22017.83 | 20817.97 | 
> -5.45 | 21934.65 | 21356.90 | -2.63
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 11414.27 | 11044.86 | 
> -3.24 | 11454.35 | 11140.34 | -2.74
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 5786.64 | 5634.05 | -2.64 
> | 5724.93 | 5639.99 | -1.48
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 21754.77 | 21466.01 | 
> -1.33 | 21140.67 | 21970.03 | 3.92
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 11676.46 | 11358.64 | 
> -2.72 | 11204.90 | 11213.48 | 0.08
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 5728.20 | 5772.49 | 0.77 
> | 5594.33 | 5544.25 | -0.90
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 30247.03 | 30179.41 | 
> -0.22 | 1538.75 | 3975.82 | 158.38
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 15988.73 | 15621.42 | 
> -2.30 | 776.04 | 1910.91 | 146.24
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 8115.84 | 8025.28 | -1.12 
> | 389.12 | 984.46 | 152.99
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 30110.91 | 30200.69 | 
> 0.30 | 1532.49 | 3983.77 | 159.95
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 15957.90 | 15690.73 | 
> -1.67 | 774.90 | 1931.00 | 149.19
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 8113.26 | 8037.93 | -0.93 
> | 391.90 | 965.53 | 146.37
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 29816.97 | 29891.54 | 
> 0.25 | 1538.12 | 3881.93 | 152.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 15405.95 | 15619.17 | 
> 1.38 | 762.49 | 1871.00 | 145.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 7919.80 | 7957.35 | 0.47 
> | 393.63 | 972.49 | 147.06

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 15 commits:

 - 8266054: Incorporating styling changes based on reviews.
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8266054
 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
JDK-8266054
 - 8266054: Removing redundant teat templates.
 - 8266054: Code reorganization for efficient sharing of logic to check rotate 
operation support on a target platform.
 - 8266054: Removing redundant test templates.
 - 8266054: Review comments resolution.
 - ... and 5 more: https://git.openjdk.java.net/jdk/compare/07e90524...609c4143

-

Changes: https://git.openjdk.java.net/jdk/pull/3720/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=3720=09
  Stats: 4393 lines in 52 files changed: 4172 ins; 60 del; 161 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3720.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v9]

2021-06-30 Thread Jatin Bhateja
4.15 | 177.36
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 8157.94 | 8466.90 | 3.79 
> | 450.26 | 1221.90 | 171.37
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 4039.74 | 4283.33 | 6.03 
> | 224.82 | 612.68 | 172.53
> RotateBenchmark.testRotateRightL | 512.00 | 31.00 | 2066.88 | 2147.51 | 3.90 
> | 110.97 | 303.43 | 173.42
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 13548.39 | 13245.87 | 
> -2.23 | 13490.93 | 13084.76 | -3.01
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 7020.16 | 6768.85 | -3.58 
> | 6991.39 | 7044.32 | 0.76
> RotateBenchmark.testRotateRightS | 128.00 | 7.00 | 3550.50 | 3505.19 | -1.28 
> | 3507.12 | 3612.86 | 3.01
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 13743.43 | 13325.44 | 
> -3.04 | 13696.15 | 13255.80 | -3.22
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 6856.02 | 6969.18 | 1.65 
> | 6886.29 | 6834.12 | -0.76
> RotateBenchmark.testRotateRightS | 128.00 | 15.00 | 3569.53 | 3492.76 | -2.15 
> | 3539.02 | 3470.02 | -1.95
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 13704.18 | 13495.07 | 
> -1.53 | 13649.14 | 13583.87 | -0.48
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 7011.77 | 6953.93 | -0.82 
> | 6978.28 | 6740.30 | -3.41
> RotateBenchmark.testRotateRightS | 128.00 | 31.00 | 3591.62 | 3620.12 | 0.79 
> | 3502.04 | 3510.05 | 0.23
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 21950.71 | 22113.60 | 0.74 
> | 21484.27 | 21596.64 | 0.52
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 11616.88 | 11099.73 | 
> -4.45 | 11188.29 | 10737.68 | -4.03
> RotateBenchmark.testRotateRightS | 256.00 | 7.00 | 5872.72 | 5579.12 | -5.00 
> | 5784.05 | 5454.57 | -5.70
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 22017.83 | 20817.97 | 
> -5.45 | 21934.65 | 21356.90 | -2.63
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 11414.27 | 11044.86 | 
> -3.24 | 11454.35 | 11140.34 | -2.74
> RotateBenchmark.testRotateRightS | 256.00 | 15.00 | 5786.64 | 5634.05 | -2.64 
> | 5724.93 | 5639.99 | -1.48
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 21754.77 | 21466.01 | 
> -1.33 | 21140.67 | 21970.03 | 3.92
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 11676.46 | 11358.64 | 
> -2.72 | 11204.90 | 11213.48 | 0.08
> RotateBenchmark.testRotateRightS | 256.00 | 31.00 | 5728.20 | 5772.49 | 0.77 
> | 5594.33 | 5544.25 | -0.90
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 30247.03 | 30179.41 | 
> -0.22 | 1538.75 | 3975.82 | 158.38
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 15988.73 | 15621.42 | 
> -2.30 | 776.04 | 1910.91 | 146.24
> RotateBenchmark.testRotateRightS | 512.00 | 7.00 | 8115.84 | 8025.28 | -1.12 
> | 389.12 | 984.46 | 152.99
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 30110.91 | 30200.69 | 
> 0.30 | 1532.49 | 3983.77 | 159.95
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 15957.90 | 15690.73 | 
> -1.67 | 774.90 | 1931.00 | 149.19
> RotateBenchmark.testRotateRightS | 512.00 | 15.00 | 8113.26 | 8037.93 | -0.93 
> | 391.90 | 965.53 | 146.37
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 29816.97 | 29891.54 | 
> 0.25 | 1538.12 | 3881.93 | 152.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 15405.95 | 15619.17 | 
> 1.38 | 762.49 | 1871.00 | 145.38
> RotateBenchmark.testRotateRightS | 512.00 | 31.00 | 7919.80 | 7957.35 | 0.47 
> | 393.63 | 972.49 | 147.06

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 13 commits:

 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge http://github.com/openjdk/jdk into JDK-8266054
 - Merge branch 'JDK-8266054' of http://github.com/jatin-bhateja/jdk into 
JDK-8266054
 - 8266054: Removing redundant teat templates.
 - 8266054: Code reorganization for efficient sharing of logic to check rotate 
operation support on a target platform.
 - 8266054: Removing redundant test templates.
 - 8266054: Review comments resolution.
 - 8266054: Review comments resolution.
 - 8266054: Review comments resolution.
 - ... and 3 more: https://git.openjdk.java.net/jdk/compare/a0f32cb1...c60355d7

-

Changes: https://git.openjdk.java.net/jdk/pull/3720/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=3720=08
  Stats: 4393 lines in 52 files changed: 4172 ins; 60 del; 161 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3720.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720

PR: https://git.openjdk.java.net/jdk/pull/3720


  1   2   >