Re: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long

2022-06-09 Thread Sandhya Viswanathan
On Wed, 8 Jun 2022 09:39:04 GMT, Quan Anh Mai  wrote:

>> Hi,
>> 
>> This patch implements intrinsics for `Integer/Long::compareUnsigned` using 
>> the same approach as the JVM does for long and floating-point comparisons. 
>> This allows efficient and reliable usage of unsigned comparison in Java, 
>> which is a basic operation and is important for range checks such as 
>> discussed in #8620 .
>> 
>> Thank you very much.
>
> I have added a benchmark for the intrinsic. The result is as follows, thanks 
> a lot:
> 
> Before  After
> Benchmark (size)  Mode  Cnt  Score   Error  Score   Error 
>  Units
> Integers.compareUnsigned 500  avgt   15  0.527 ± 0.002  0.498 ± 0.011 
>  us/op
> Longs.compareUnsigned500  avgt   15  0.677 ± 0.014  0.561 ± 0.006 
>  us/op

@merykitty Could you please also add the micro benchmark where compareUnsigned 
result is stored directly in an integer and show the performance of that?

-

PR: https://git.openjdk.org/jdk/pull/9068


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-02 Thread Sandhya Viswanathan
On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong  wrote:

>>> @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
>
>> > @XiaohongGong Could you please rebase the branch and resolve conflicts?
>> 
>> Sure, I'm working on this now. The patch will be updated soon. Thanks.
> 
> Resolved the conflicts. Thanks!

@XiaohongGong You need one more review approval.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-02 Thread Sandhya Viswanathan
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-06-01 Thread Sandhya Viswanathan
On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong  wrote:

>> Yes, the tests were run in debug mode. The reporting of the missing constant 
>> occurs for the compiled method that is called from the method where the 
>> constants are declared e.g.:
>> 
>> 719  240bjdk.incubator.vector.Int256Vector::fromArray0 (15 
>> bytes)
>>   ** Rejected vector op (LoadVectorMasked,int,8) because architecture does 
>> not support it
>>   ** missing constant: offsetInRange=Parm
>> @ 11   
>> jdk.incubator.vector.IntVector::fromArray0Template (22 bytes)   force inline 
>> by annotation
>> 
>> 
>> So it appears to be working as expected. A similar pattern occurs at a 
>> lower-level for the passing of the mask class. `Int256Vector::fromArray0` 
>> passes a constant class to `IntVector::fromArray0Template` (the compilation 
>> of which bails out before checking that the `offsetInRange` is constant).
>
> You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, 
> and no failure and performance regression are found. I will update the patch 
> soon. Thanks for the help!

@XiaohongGong Could you please rebase the branch and resolve conflicts?

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3]

2022-05-23 Thread Sandhya Viswanathan
On Sat, 21 May 2022 10:31:25 GMT, Quan Anh Mai  wrote:

>> Hi,
>> 
>> This patch optimises the matching rules for floating-point comparison with 
>> respects to eq/ne on x86-64
>> 
>> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` 
>> is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which 
>> improves the sequence of `If (CmpF x x) (Bool ne)` from
>> 
>> ucomiss xmm0, xmm0
>> jp  label
>> jne label
>> 
>> into
>> 
>> ucomiss xmm0, xmm0
>> jp  label
>> 
>> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as 
>> `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of 
>> fixing the flags, such as
>> 
>> xorlecx, ecx
>> ucomiss xmm0, xmm1
>> jnp done
>> pushf
>> andq[rsp], 0xff2b
>> popf
>> done:
>> movleax, 1
>> cmovel  eax, ecx
>> 
>> The patch changes this sequence into
>> 
>> xorlecx, ecx
>> ucomiss xmm0, xmm1
>> movleax, 1
>> cmovpl  eax, ecx
>> cmovnel eax, ecx
>> 
>> 3, The patch also changes the pattern of `isInfinite` to be more optimised 
>> by using `Math.abs` to reduce 1 comparison and compares the result with 
>> `MAX_VALUE` since `>` is more optimised than `==` for floating-point types.
>> 
>> The benchmark results are as follow:
>> 
>>  Before  
>> After
>> Benchmark  Mode  Cnt Score ErrorScore
>>  Error   Unit   Ratio
>> FPComparison.equalDouble   avgt5  2876.242 ±  58.875  594.636 ±  
>>  8.922  ns/op4.84
>> FPComparison.equalFloatavgt5  3062.430 ±  31.371  663.849 ±  
>>  3.656  ns/op4.61
>> FPComparison.isFiniteDoubleavgt5   475.749 ±  19.027  518.309 ± 
>> 107.352  ns/op0.92
>> FPComparison.isFiniteFloat avgt5   506.525 ±  14.417  515.576 ±  
>> 14.669  ns/op0.98
>> FPComparison.isInfiniteDouble  avgt5  1232.800 ±  31.677  621.185 ±  
>> 11.935  ns/op1.98
>> FPComparison.isInfiniteFloat   avgt5  1234.708 ±  70.239  623.566 ±  
>> 15.206  ns/op1.98
>> FPComparison.isNanDouble   avgt5  2255.847 ±   7.238  400.124 ±  
>>  0.762  ns/op5.64
>> FPComparison.isNanFloatavgt5  2567.044 ±  36.078  546.486 ±  
>>  1.509  ns/op4.70
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   comments

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/8525


Re: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2]

2022-05-20 Thread Sandhya Viswanathan
On Wed, 4 May 2022 23:16:41 GMT, Vladimir Kozlov  wrote:

>> src/hotspot/cpu/x86/x86_64.ad line 6998:
>> 
>>> 6996:   ins_encode %{
>>> 6997: __ cmovl(Assembler::parity, $dst$$Register, $src$$Register);
>>> 6998: __ cmovl(Assembler::notEqual, $dst$$Register, $src$$Register);
>> 
>> Should this be `equal`?
>
> I see that you swapped `src, dst` in `match()` but `format` is sill incorrect 
> and the code is confusing.

I agree with @vnkozlov that this needs explanation. Could you please add 
comments here with IR and example code generated for both the eq case and ne 
case? You have some explanation in the PR description but not in the code. The 
description needs to be in the code as well for maintenance.

-

PR: https://git.openjdk.java.net/jdk/pull/8525


Re: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne

2022-05-20 Thread Sandhya Viswanathan
On Wed, 18 May 2022 14:59:33 GMT, Quan Anh Mai  wrote:

>> Hi,
>> 
>> This patch optimises the matching rules for floating-point comparison with 
>> respects to eq/ne on x86-64
>> 
>> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` 
>> is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which 
>> improves the sequence of `If (CmpF x x) (Bool ne)` from
>> 
>> ucomiss xmm0, xmm0
>> jp  label
>> jne label
>> 
>> into
>> 
>> ucomiss xmm0, xmm0
>> jp  label
>> 
>> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as 
>> `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of 
>> fixing the flags, such as
>> 
>> xorlecx, ecx
>> ucomiss xmm0, xmm1
>> jnp done
>> pushf
>> andq[rsp], 0xff2b
>> popf
>> done:
>> movleax, 1
>> cmovel  eax, ecx
>> 
>> The patch changes this sequence into
>> 
>> xorlecx, ecx
>> ucomiss xmm0, xmm1
>> movleax, 1
>> cmovpl  eax, ecx
>> cmovnel eax, ecx
>> 
>> 3, The patch also changes the pattern of `isInfinite` to be more optimised 
>> by using `Math.abs` to reduce 1 comparison and compares the result with 
>> `MAX_VALUE` since `>` is more optimised than `==` for floating-point types.
>> 
>> The benchmark results are as follow:
>> 
>> Before:
>> Benchmark  Mode  Cnt Score Error  Units
>> FPComparison.equalDouble   avgt5  2876.242 ±  58.875  ns/op
>> FPComparison.equalFloatavgt5  3062.430 ±  31.371  ns/op
>> FPComparison.isFiniteDoubleavgt5   475.749 ±  19.027  ns/op
>> FPComparison.isFiniteFloat avgt5   506.525 ±  14.417  ns/op
>> FPComparison.isInfiniteDouble  avgt5  1232.800 ±  31.677  ns/op
>> FPComparison.isInfiniteFloat   avgt5  1234.708 ±  70.239  ns/op
>> FPComparison.isNanDouble   avgt5  2255.847 ±   7.238  ns/op
>> FPComparison.isNanFloatavgt5  2567.044 ±  36.078  ns/op
>> 
>> After:
>> Benchmark  Mode  Cnt Score Error  Units
>> FPComparison.equalDouble   avgt5   594.636 ±   8.922  ns/op
>> FPComparison.equalFloatavgt5   663.849 ±   3.656  ns/op
>> FPComparison.isFiniteDoubleavgt5   518.309 ± 107.352  ns/op
>> FPComparison.isFiniteFloat avgt5   515.576 ±  14.669  ns/op
>> FPComparison.isInfiniteDouble  avgt5   621.185 ±  11.935  ns/op
>> FPComparison.isInfiniteFloat   avgt5   623.566 ±  15.206  ns/op
>> FPComparison.isNanDouble   avgt5   400.124 ±   0.762  ns/op
>> FPComparison.isNanFloatavgt5   546.486 ±   1.509  ns/op
>> 
>> Thank you very much.
>
> I have reverted the changes to `java.lang.Float` and `java.lang.Double` to 
> not interfere with the intrinsic PR. More tests are added to cover all cases 
> regarding floating-point comparison of compiled code.
> 
> The rules for fp comparison that output the result to `rFlagRegsU` are 
> expensive and should be avoided. As a result, I removed the shortcut rules 
> with memory or constant operands to reduce the number of match rules. Only 
> the basic rules are kept.
> 
> Thanks.

@merykitty Very nice work! The patch looks good to me.

@merykitty Very nice work! The patch looks good to me.

-

PR: https://git.openjdk.java.net/jdk/pull/8525


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3]

2022-05-05 Thread Sandhya Viswanathan
On Fri, 6 May 2022 03:47:47 GMT, Xiaohong Gong  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1238:
>> 
>>> 1236: } else {
>>> 1237:   // Masked vector load with IOOBE always uses the predicated 
>>> load.
>>> 1238:   const TypeInt* offset_in_range = 
>>> gvn().type(argument(8))->isa_int();
>> 
>> Should it be `argument(7)`? (and adjustments later to access the container).
>
> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` 
> is the mask input. It seems the argument number is not right begin from the 
> mask input which is expected to be `6`. But the it's not. Actually I don't 
> quite understand why.

offset is long so uses two argument slots (5 and 6). 
mask is argument (7).
offsetInRange is argument(8).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8286029: Add classpath exemption to globals_vectorApiSupport_***.S.inc

2022-05-04 Thread Sandhya Viswanathan
On Mon, 2 May 2022 20:05:36 GMT, Tyler Steele  wrote:

> Adds missing classpath exception to the header of two GPLv2 files.
> 
> Requested 
> [here](https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2022-April/013988.html).

src/jdk.incubator.vector/linux/native/libjsvml/globals_vectorApiSupport_linux.S.inc
 line 4:

> 2:  * Copyright (c) 1997, 2021, Oracle and/or its affiliates. All rights 
> reserved.
> 3:  * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> 4:  *

Please update the copyright year to 2022.

-

PR: https://git.openjdk.java.net/jdk/pull/8508


Re: RFR: 8286029: Add classpath exemption to globals_vectorApiSupport_***.S.inc

2022-05-04 Thread Sandhya Viswanathan
On Mon, 2 May 2022 20:05:36 GMT, Tyler Steele  wrote:

> Adds missing classpath exception to the header of two GPLv2 files.
> 
> Requested 
> [here](https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2022-April/013988.html).

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/8508


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-28 Thread Sandhya Viswanathan
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename the "usePred" to "offsetInRange"

@PaulSandoz Could you please take a look at the Java changes when you find 
time. This PR from @XiaohongGong is a very good step towards long standing 
Vector API  wish list for better tail loop handling.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]

2022-04-27 Thread Sandhya Viswanathan
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Rename the "usePred" to "offsetInRange"

Rest of the patch looks good to me.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1232:

> 1230:   // out when current case uses the predicate feature.
> 1231:   if (!supports_predicate) {
> 1232: bool use_predicate = false;

If we rename this to needs_predicate it will be easier to understand.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

2022-04-08 Thread Sandhya Viswanathan
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong  wrote:

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 2861:

> 2859: ByteSpecies vsp = (ByteSpecies) species;
> 2860: if (offset >= 0 && offset <= (a.length - 
> species.vectorByteSize())) {
> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* 
> usePred */ false).maybeSwap(bo);

Instead of usePred a term like inRange or offetInRage or offsetInVectorRange 
would be easier to follow.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]

2022-04-08 Thread Sandhya Viswanathan
On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa  
wrote:

>> Optimizes the divideUnsigned() and remainderUnsigned() methods in 
>> java.lang.Integer and java.lang.Long classes using x86 intrinsics. This 
>> change shows 3x improvement for Integer methods and upto 25% improvement for 
>> Long. This change also implements the DivMod optimization which fuses 
>> division and modulus operations if needed. The DivMod optimization shows 3x 
>> improvement for Integer and ~65% improvement for Long.
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   uncomment zero in integer div, mod test

My suggestion is to keep the -ve path assembly optimization in this patch.
When the optimization in IR is introduced, the assembly could then be 
simplified as suggested by @merykitty.

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]

2022-04-05 Thread Sandhya Viswanathan
On Tue, 5 Apr 2022 20:26:18 GMT, Vamsi Parasa  wrote:

>> Optimizes the divideUnsigned() and remainderUnsigned() methods in 
>> java.lang.Integer and java.lang.Long classes using x86 intrinsics. This 
>> change shows 3x improvement for Integer methods and upto 25% improvement for 
>> Long. This change also implements the DivMod optimization which fuses 
>> division and modulus operations if needed. The DivMod optimization shows 3x 
>> improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add error msg for jtreg test

Marked as reviewed by sviswanathan (Reviewer).

Looks good to me. You need one more review.

@vnkozlov Could you please help review this patch?

-

PR: https://git.openjdk.java.net/jdk/pull/7572


Re: RFR: 8279508: Auto-vectorize Math.round API [v2]

2022-03-11 Thread Sandhya Viswanathan
On Thu, 3 Mar 2022 05:42:23 GMT, Jatin Bhateja  wrote:

>> The testing for this PR doesn't look adequate to me. I don't see any testing 
>> for the values where the behavior of round has been redefined at points in 
>> the last decade. See JDK-8010430 and JDK-6430675, both of which have 
>> regression tests in the core libs area. Thanks.
>
> Hi @jddarcy , can you kindly validate your feedback, it has been incorporated.

@jatin-bhateja There is a failure reported in the Pre-submit tests on Windows 
x64 for compiler/vectorization/TestRoundVect.java. Could you please take a look?

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-03-02 Thread Sandhya Viswanathan
On Sat, 26 Feb 2022 01:07:47 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Adding descriptive comments.
>
> src/hotspot/cpu/x86/x86.ad line 7295:
> 
>> 7293: __ vector_round_double_evex($dst$$XMMRegister, $src$$XMMRegister, 
>> $xtmp1$$XMMRegister,
>> 7294: $xtmp2$$XMMRegister, 
>> $ktmp1$$KRegister, $ktmp2$$KRegister,
>> 7295: 
>> ExternalAddress(vector_double_signflip()), new_mxcsr, $scratch$$Register, 
>> vlen_enc);
> 
> The vector_double_signflip() here should be replaced by vector_all_bits_set().
> vcvtpd2qq description:
> If a converted result cannot be represented in the destination
> format, the floating-point invalid exception is raised, and if this exception 
> is masked, the indefinite integer value
> (2w-1, where w represents the number of bits in the destination format) is 
> returned.

The overflow value observed is 2^(w-1) so using vector_double_signflip() is 
correct, please ignore this comment.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-03-02 Thread Sandhya Viswanathan
On Sat, 26 Feb 2022 04:55:08 GMT, Jatin Bhateja  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Adding descriptive comments.
>
> As per SDM, if post conversion a floating point number is non-representable 
> in destination format e.g. a floating point value 3.4028235E10 post integer 
> conversion will overflow the value range of integer primitive type, hence a 
> -0.0 value or 0x8000 is returned here. Similarly for +/- NaN and  +/-Inf 
> post conversion value returns is -0.0.  All these cases i.e. post conversion 
> non-representable floating point values and NaN/Inf values are handled in a 
> special manner where algorithm first performs an unordered comparison b/w 
> original source value and returns a 0 in case of  NaN, this weeds out the NaN 
> case and for rest of the special values we check the MSB bit of the source 
> and either return an Integer.MAX_VALUE for +ve numbers or a Integer.MIN_VALUE 
> to adhere to the semantics of Math.round API.
> 
> Existing tests were enhanced to cover various special cases (NaN/Inf/+ve/-ve 
> value/values which may be inexact after adding 0.5/ values which post 
> conversion overflow integer value range).

@jatin-bhateja The patch looks good to me.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v11]

2022-03-02 Thread Sandhya Viswanathan
On Wed, 2 Mar 2022 02:44:41 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Removing +LogCompilation flag.

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-03-02 Thread Sandhya Viswanathan
On Sat, 26 Feb 2022 03:38:32 GMT, Quan Anh Mai  wrote:

>> I believe the indefinite value should be 2^(w - 1) (a.k.a 0x8000) and 
>> the documentation is typoed. If you look at `cvtss2si`, the indefinite value 
>> is also written as 2^w - 1 but yet in `MacroAssembler::convert_f2i` we 
>> compare it with 0x8000. In addition, choosing -1 as an indefinite value 
>> is weird enough and to complicate it as 2^w - 1 is really unusual.
>
> `MacroAssembler::convert_f2i`
> 
> https://github.com/openjdk/jdk/blob/c5c6058fd57d4b594012035eaf18a57257f4ad85/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8919

@jatin-bhateja  @merykitty You are right, on overflow we observe 2^(w - 1) i.e. 
0x8000  so using vector_float_signflip() is correct.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-02-25 Thread Sandhya Viswanathan
On Sat, 26 Feb 2022 01:06:21 GMT, Sandhya Viswanathan 
 wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8279508: Adding descriptive comments.
>
> src/hotspot/cpu/x86/x86.ad line 7263:
> 
>> 7261: __ vector_round_float_avx($dst$$XMMRegister, $src$$XMMRegister, 
>> $xtmp1$$XMMRegister,
>> 7262:   $xtmp2$$XMMRegister, 
>> $xtmp3$$XMMRegister, $xtmp4$$XMMRegister,
>> 7263:   
>> ExternalAddress(vector_float_signflip()), new_mxcsr, $scratch$$Register, 
>> vlen_enc);
> 
> The vector_float_signflip() here should be replaced by vector_all_bits_set().
> cvtps2dq description:
> If a converted result cannot be represented in the destination
> format, the floating-point invalid exception is raised, and if this exception 
> is masked, the indefinite integer value
> (2w-1, where w represents the number of bits in the destination format) is 
> returned.

Clarification, the number in my comments above is (2^w  - 1). This is from 
Intel SDM 
(https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html).
Also you will need to take care when the valid unoverflowed result is -1 i.e. 
0x (2^32 - 1).

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v9]

2022-02-25 Thread Sandhya Viswanathan
On Fri, 25 Feb 2022 06:22:42 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain 
>> ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 
>> 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 
>> 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 
>> 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 
>> 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Adding descriptive comments.

Other than this the patch looks good to me. What testing have you done?

src/hotspot/cpu/x86/x86.ad line 7263:

> 7261: __ vector_round_float_avx($dst$$XMMRegister, $src$$XMMRegister, 
> $xtmp1$$XMMRegister,
> 7262:   $xtmp2$$XMMRegister, $xtmp3$$XMMRegister, 
> $xtmp4$$XMMRegister,
> 7263:   ExternalAddress(vector_float_signflip()), 
> new_mxcsr, $scratch$$Register, vlen_enc);

The vector_float_signflip() here should be replaced by vector_all_bits_set().
cvtps2dq description:
If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception 
is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is 
returned.

src/hotspot/cpu/x86/x86.ad line 7280:

> 7278: __ vector_round_float_evex($dst$$XMMRegister, $src$$XMMRegister, 
> $xtmp1$$XMMRegister,
> 7279:$xtmp2$$XMMRegister, $ktmp1$$KRegister, 
> $ktmp2$$KRegister,
> 7280:
> ExternalAddress(vector_float_signflip()), new_mxcsr, $scratch$$Register, 
> vlen_enc);

The vector_float_signflip() here should be replaced by vector_all_bits_set().

src/hotspot/cpu/x86/x86.ad line 7295:

> 7293: __ vector_round_double_evex($dst$$XMMRegister, $src$$XMMRegister, 
> $xtmp1$$XMMRegister,
> 7294: $xtmp2$$XMMRegister, $ktmp1$$KRegister, 
> $ktmp2$$KRegister,
> 7295: 
> ExternalAddress(vector_double_signflip()), new_mxcsr, $scratch$$Register, 
> vlen_enc);

The vector_double_signflip() here should be replaced by vector_all_bits_set().
vcvtpd2qq description:
If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception 
is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is 
returned.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-23 Thread Sandhya Viswanathan
On Wed, 23 Feb 2022 09:03:37 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
>> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
>> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
>> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
>> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Review comments resolved.

test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java line 441:

> 439:   errn += verify("test_round: ", 1, l0[1], Long.MAX_VALUE);
> 440:   errn += verify("test_round: ", 2, l0[2], Long.MIN_VALUE);
> 441:   errn += verify("test_round: ", 3, l0[3], Long.MAX_VALUE);

Good to add additional test cases:
  Case with a1 value >= Long Max and < infinity. 
  Case with a1 value <= Long Min and > -infinity.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-23 Thread Sandhya Viswanathan
On Wed, 23 Feb 2022 09:03:37 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
>> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
>> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
>> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
>> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Review comments resolved.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 8984:

> 8982: }
> 8983: 
> 8984: void MacroAssembler::round_double(Register dst, XMMRegister src, 
> Register rtmp, Register rcx) {

Is it possible to implement this using the similar mxcsr change? In any case 
comments will help to review round_double and round_float code.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v7]

2022-02-23 Thread Sandhya Viswanathan
On Wed, 23 Feb 2022 09:03:37 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
>> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
>> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
>> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
>> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Review comments resolved.

Also curious, how does the performance look with all these changes.

src/hotspot/cpu/x86/assembler_x86.hpp line 2254:

> 2252:   void vroundps(XMMRegister dst, XMMRegister src, int32_t rmode, int 
> vector_len);
> 2253:   void vrndscaleps(XMMRegister dst,  XMMRegister src,  int32_t rmode, 
> int vector_len);
> 2254: 

These instructions are not used anymore and can be removed.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4116:

> 4114: KRegister ktmp1, KRegister 
> ktmp2, AddressLiteral double_sign_flip,
> 4115: Register scratch, int 
> vec_enc) {
> 4116:   evcvttpd2qq(dst, src, vec_enc);

The vcvttpd2qq instruction on overflow sets the result as  2^w -1 where w is 
64. Whereas the special case handling is expecting 0x8.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4145:

> 4143:   evpbroadcastq(xtmp1, scratch, vec_enc);
> 4144:   vaddpd(xtmp1, src , xtmp1, vec_enc);
> 4145:   evcvtpd2qq(dst, xtmp1, vec_enc);

The vcvtpd2qq instruction on overflow also sets the result as 2^w -1 where w is 
64. Whereas the special case handling is expecting 0x8.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4176:

> 4174:   vpbroadcastd(xtmp1, xtmp1, vec_enc);
> 4175:   vaddps(xtmp1, src , xtmp1, vec_enc);
> 4176:   vcvtps2dq(dst, xtmp1, vec_enc);

The vcvtps2dq returns 0x7FFF in case of overflow whereas the special case 
handling expects 0x8000 incase of overflow. The same question applies to 
the corresponding vector_round_float_avx() implementation as well.

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8279508: Auto-vectorize Math.round API [v6]

2022-02-22 Thread Sandhya Viswanathan
On Thu, 17 Feb 2022 17:43:43 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | 
>> Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
>> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
>> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
>> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Fixing for windows failure.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4146:

> 4144:   vaddpd(xtmp1, src , xtmp1, vec_enc);
> 4145:   vrndscalepd(dst, xtmp1, 0x4, vec_enc);
> 4146:   evcvtpd2qq(dst, dst, vec_enc);

Why do we need vrndscalepd in between, could we not directly use cvtpd2qq after 
vaddpd?

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8278173: [vectorapi] Add x64 intrinsics for unsigned (zero extended) casts [v3]

2022-02-14 Thread Sandhya Viswanathan
On Sun, 13 Feb 2022 05:18:34 GMT, Quan Anh Mai  wrote:

>> Hi,
>> 
>> This patch implements the unsigned upcast intrinsics in x86, which are used 
>> in vector lane-wise reinterpreting operations.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   missing ForceInline

Marked as reviewed by sviswanathan (Reviewer).

Hotspot changes look good to me.

-

PR: https://git.openjdk.java.net/jdk/pull/7358


Re: RFR: 8278173: [vectorapi] Add x64 intrinsics for unsigned (zero extended) casts

2022-02-09 Thread Sandhya Viswanathan
On Sat, 5 Feb 2022 15:34:08 GMT, Quan Anh Mai  wrote:

> Hi,
> 
> This patch implements the unsigned upcast intrinsics in x86, which are used 
> in vector lane-wise reinterpreting operations.
> 
> Thank you very much.

src/hotspot/cpu/x86/assembler_x86.cpp line 4782:

> 4780:   vector_len == AVX_256bit? VM_Version::supports_avx2() :
> 4781:   vector_len == AVX_512bit? VM_Version::supports_evex() : 0, " ");
> 4782:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

legacy_mode should be false here instead of _legacy_mode_bw.

-

PR: https://git.openjdk.java.net/jdk/pull/7358


Re: RFR: 8279508: Auto-vectorize Math.round API [v2]

2022-01-20 Thread Sandhya Viswanathan
On Wed, 19 Jan 2022 17:38:25 GMT, Jatin Bhateja  wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar 
>> IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>>   |   | BASELINE AVX2 | WithOpt AVX2 | Gain (opt/baseline) | Baseline AVX3 | 
>> Withopt AVX3 | Gain (opt/baseline)
>> -- | -- | -- | -- | -- | -- | -- | --
>> Benchmark | ARRAYLEN | Score (ops/ms) | Score (ops/ms) |   | Score (ops/ms) 
>> | Score (ops/ms) |  
>> FpRoundingBenchmark.test_round_double | 1024 | 518.532 | 1364.066 | 
>> 2.630630318 | 512.908 | 4292.11 | 8.368186887
>> FpRoundingBenchmark.test_round_double | 2048 | 270.137 | 830.986 | 
>> 3.076165057 | 273.159 | 2459.116 | 9.002507697
>> FpRoundingBenchmark.test_round_float | 1024 | 752.436 | 7780.905 | 
>> 10.34095259 | 752.49 | 9506.694 | 12.63364829
>> FpRoundingBenchmark.test_round_float | 2048 | 389.499 | 4113.046 | 
>> 10.55983712 | 389.63 | 4863.673 | 12.48279907
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   8279508: Adding a test for scalar intrinsification.

The JVM currently initializes the x86 mxcsr to round to nearest even, see below 
in stubGenerator_x86_64.cpp:
// Round to nearest (even), 64-bit mode, exceptions masked
StubRoutines::x86::_mxcsr_std = 0x1F80;
The above works for Math.rint which is specified to be round to nearest even.
Please see:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
 : section 4.8.4

The rounding mode needed for Math.round is round to positive infinity which 
needs a different x86 mxcsr initialization(0x5F80).

-

PR: https://git.openjdk.java.net/jdk/pull/7094


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v5]

2022-01-06 Thread Sandhya Viswanathan
On Thu, 6 Jan 2022 18:26:32 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 512 | 1515.147 | 1536.647 | 1.014190042
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 1024 | 911.58 | 1030.54 | 1.130498695
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 256 | 2034.611 | 2073.764 | 1.019243482
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 512 | 1110.659 | 1116.093 | 1.004892591
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 1024 | 559.269 | 559.651 | 1.000683034
>> 

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-06 Thread Sandhya Viswanathan
On Wed, 5 Jan 2022 08:59:00 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 512 | 1515.147 | 1536.647 | 1.014190042
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 1024 | 911.58 | 1030.54 | 1.130498695
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 256 | 2034.611 | 2073.764 | 1.019243482
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 512 | 1110.659 | 1116.093 | 1.004892591
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 1024 | 559.269 | 559.651 | 1.000683034
>> 

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]

2022-01-04 Thread Sandhya Viswanathan
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 512 | 1515.147 | 1536.647 | 1.014190042
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 1024 | 911.58 | 1030.54 | 1.130498695
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 256 | 2034.611 | 2073.764 | 1.019243482
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 512 | 1110.659 | 1116.093 | 1.004892591
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 1024 | 559.269 | 559.651 | 1.000683034
>> 

Re: RFR: 8275821: Optimize random number generators developed in JDK-8248862 using Math.unsignedMultiplyHigh() [v4]

2021-12-02 Thread Sandhya Viswanathan
On Thu, 2 Dec 2021 20:43:56 GMT, Vamsi Parasa  wrote:

>> This change optimizes random number generators using 
>> Math.unsignedMultiplyHigh()
>
> Vamsi Parasa has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add seeds for the random generators to eliminate run-to-run variance

@PaulSandoz Could you please also review this small PR.

-

PR: https://git.openjdk.java.net/jdk/pull/6206


Re: RFR: JDK-8278014: [vectorapi] Remove test run script

2021-11-30 Thread Sandhya Viswanathan
On Tue, 30 Nov 2021 19:22:53 GMT, Paul Sandoz  wrote:

> Remove Vector API scripts for building and running tests. `jtreg` should be 
> used instead.
> 
> Also updated the test generation script to remove options that assume 
> mercurial as the code repository.

Looks good to me.

-

Marked as reviewed by sviswanathan (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6621


Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh [v2]

2021-10-20 Thread Sandhya Viswanathan
On Tue, 19 Oct 2021 20:34:55 GMT, Vamsi Parasa  wrote:

>> Optimize the new Math.unsignedMultiplyHigh using the x86 mul instruction. 
>> This change show 1.87X improvement on a micro benchmark.
>
> Vamsi Parasa has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   refactoring to remove code duplication by using a common routine for 
> UMulHiLNode and MulHiLNode

Marked as reviewed by sviswanathan (Reviewer).

The patch looks good to me.

-

PR: https://git.openjdk.java.net/jdk/pull/5933


Re: RFR: 8275167: x86 intrinsic for unsignedMultiplyHigh [v2]

2021-10-20 Thread Sandhya Viswanathan
On Fri, 15 Oct 2021 20:19:31 GMT, Vladimir Kozlov  wrote:

>>> How you verified correctness of results? I suggest to extend 
>>> `test/jdk//java/lang/Math/MultiplicationTests.java` test to cover unsigned 
>>> method.
>> 
>> Tests for unsignedMultiplyHigh were already added in 
>> test/jdk//java/lang/Math/MultiplicationTests.java (in July 2021 by Brian 
>> Burkhalter). Used that test to verify the correctness of the results.
>
>> > How you verified correctness of results? I suggest to extend 
>> > `test/jdk//java/lang/Math/MultiplicationTests.java` test to cover unsigned 
>> > method.
>> 
>> Tests for unsignedMultiplyHigh were already added in 
>> test/jdk//java/lang/Math/MultiplicationTests.java (in July 2021 by Brian 
>> Burkhalter). Used that test to verify the correctness of the results.
> 
> Good. It seems I have old version of the test.
> Did you run it with -Xcomp? How you verified that intrinsic is used?

@vnkozlov if the patch looks ok to you, could you please run this through your 
testing?

-

PR: https://git.openjdk.java.net/jdk/pull/5933


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v3]

2021-10-19 Thread Sandhya Viswanathan
On Tue, 19 Oct 2021 22:34:13 GMT, Paul Sandoz  wrote:

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorMask.java 
>> line 574:
>> 
>>> 572:  * @throws ClassCastException if the species is wrong
>>> 573:  */
>>> 574: abstract  VectorMask check(Class> 
>>> maskClass, Vector vector);
>> 
>> This is a package-private method so the java doc style comments are not 
>> needed here.
>
> I think that is fine, documentation for us :-) I converted to package private 
> since it only really makes sense for internal use right now.

Sounds good.

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v3]

2021-10-19 Thread Sandhya Viswanathan
On Sat, 16 Oct 2021 00:56:14 GMT, Paul Sandoz  wrote:

>> This PR improves the performance of vector operations that accept masks on 
>> architectures that support masking in hardware, specifically Intel AVX512 
>> and ARM SVE.
>> 
>> On architectures that do not support masking in hardware the same technique 
>> as before is applied to most operations, specifically composition using 
>> blend.
>> 
>> Masked loads/stores are a special form of masked operation that require 
>> additional care to ensure out-of-bounds access throw exceptions. The range 
>> checking has not been fully optimized and will require further work.
>> 
>> No API enhancements were required and only a few additional tests were 
>> needed.
>
> Paul Sandoz has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - Merge branch 'master' into JDK-8271515-vector-api
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/152
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/142
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/139
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/151
>  - Add new files.
>  - 8271515: Integration of JEP 417: Vector API (Third Incubator)

The Java changes look good to me.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorMask.java 
line 574:

> 572:  * @throws ClassCastException if the species is wrong
> 573:  */
> 574: abstract  VectorMask check(Class> 
> maskClass, Vector vector);

This is a package-private method so the java doc style comments are not needed 
here.

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v4]

2021-10-19 Thread Sandhya Viswanathan
On Tue, 19 Oct 2021 22:37:10 GMT, Paul Sandoz  wrote:

>> This PR improves the performance of vector operations that accept masks on 
>> architectures that support masking in hardware, specifically Intel AVX512 
>> and ARM SVE.
>> 
>> On architectures that do not support masking in hardware the same technique 
>> as before is applied to most operations, specifically composition using 
>> blend.
>> 
>> Masked loads/stores are a special form of masked operation that require 
>> additional care to ensure out-of-bounds access throw exceptions. The range 
>> checking has not been fully optimized and will require further work.
>> 
>> No API enhancements were required and only a few additional tests were 
>> needed.
>
> Paul Sandoz has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Resolve review comments.

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v3]

2021-10-19 Thread Sandhya Viswanathan
On Tue, 19 Oct 2021 19:51:54 GMT, Paul Sandoz  wrote:

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
>> line 603:
>> 
>>> 601: if (opKind(op, VO_SPECIAL)) {
>>> 602: if (op == ZOMO) {
>>> 603: return blend(broadcast(-1), compare(NE, 0, m));
>> 
>> This doesn't look correct. The lanes where mask is false should get the 
>> original lane value in this vector.
>
> That should work, since `compare(NE, 0, m) === compare(NE, 0).and(m)`, so 
> when an `m` lane is unset the lane element of `this` vector will be selected.
> 
> Running jshell against a build of PR:
> 
> $ ~/Projects/jdk/jdk/build/macosx-x86_64-server-release/images/jdk/bin/jshell 
> --add-modules jdk.incubator.vector
> |  Welcome to JShell -- Version 18-internal
> |  For an introduction type: /help intro
> 
> jshell> import jdk.incubator.vector.*
> 
> jshell> var s = IntVector.SPECIES_256;
> s ==> Species[int, 8, S_256_BIT]
> 
> jshell> var v = IntVector.fromArray(s, new int[]{0, 1, 0, -2, 0, 3, 0, -4}, 
> 0);
> v ==> [0, 1, 0, -2, 0, 3, 0, -4]
> 
> jshell> var z = v.lanewise(VectorOperators.ZOMO);
> z ==> [0, -1, 0, -1, 0, -1, 0, -1]
> 
> jshell> z = v.lanewise(VectorOperators.ZOMO, s.loadMask(new boolean[]{false, 
> false, false, false, true, true, true, true}, 0));
> z ==> [0, 1, 0, -2, 0, -1, 0, -1]
> 
> jshell>

Yes, you are correct. There is no problem here.

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v3]

2021-10-19 Thread Sandhya Viswanathan
On Tue, 19 Oct 2021 18:54:01 GMT, Paul Sandoz  wrote:

>> src/hotspot/share/utilities/globalDefinitions_vecApi.hpp line 29:
>> 
>>> 27: // the intent of this file to provide a header that can be included in 
>>> .s files.
>>> 28: 
>>> 29: #ifndef SHARE_VM_UTILITIES_GLOBALDEFINITIONS_VECAPI_HPP
>> 
>> The file src/hotspot/share/utilities/globalDefinitions_vecApi.hpp is not 
>> needed.
>
> I notice 
> src/jdk.incubator.vector/windows/native/libsvml/globals_vectorApiSupport_windows.S.inc
>  contains a refence in comments to that file, I presume i can remove that 
> comment too?

Yes, that comment can also be removed. It is a leftover from when svml was 
built as part of libjvm.so.

>> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java
>>  line 278:
>> 
>>> 276: @Override
>>> 277: @ForceInline
>>> 278: public Byte128Vector lanewise(Unary op, VectorMask m) {
>> 
>> Should this method be final as well?
>
> It's actually redundant because the class is final. Better to drop final from 
> all declarations, at the risk of creating a larger diff.

Got it. I am ok with leaving things as is if it makes it easier.

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8271515: Integration of JEP 417: Vector API (Third Incubator) [v3]

2021-10-18 Thread Sandhya Viswanathan
On Sat, 16 Oct 2021 00:56:14 GMT, Paul Sandoz  wrote:

>> This PR improves the performance of vector operations that accept masks on 
>> architectures that support masking in hardware, specifically Intel AVX512 
>> and ARM SVE.
>> 
>> On architectures that do not support masking in hardware the same technique 
>> as before is applied to most operations, specifically composition using 
>> blend.
>> 
>> Masked loads/stores are a special form of masked operation that require 
>> additional care to ensure out-of-bounds access throw exceptions. The range 
>> checking has not been fully optimized and will require further work.
>> 
>> No API enhancements were required and only a few additional tests were 
>> needed.
>
> Paul Sandoz has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - Merge branch 'master' into JDK-8271515-vector-api
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/152
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/142
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/139
>  - Apply patch from https://github.com/openjdk/panama-vector/pull/151
>  - Add new files.
>  - 8271515: Integration of JEP 417: Vector API (Third Incubator)

src/hotspot/share/utilities/globalDefinitions.hpp line 36:

> 34: 
> 35: #include COMPILER_HEADER(utilities/globalDefinitions)
> 36: #include "utilities/globalDefinitions_vecApi.hpp"

This change is not needed.

src/hotspot/share/utilities/globalDefinitions_vecApi.hpp line 29:

> 27: // the intent of this file to provide a header that can be included in .s 
> files.
> 28: 
> 29: #ifndef SHARE_VM_UTILITIES_GLOBALDEFINITIONS_VECAPI_HPP

The file src/hotspot/share/utilities/globalDefinitions_vecApi.hpp is not needed.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java 
line 67:

> 65: 
> 66: @Override
> 67: public boolean laneIsSet(int i) {

Missing ForceInline.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 278:

> 276: @Override
> 277: @ForceInline
> 278: public Byte128Vector lanewise(Unary op, VectorMask m) {

Should this method be final as well?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 290:

> 288: @Override
> 289: @ForceInline
> 290: public Byte128Vector lanewise(Binary op, Vector v, 
> VectorMask m) {

Should this method be final as well?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 313:

> 311: public final
> 312: Byte128Vector
> 313: lanewise(Ternary op, Vector v1, Vector v2) {

For unary and binary operator above, we use VectorOperators.Unary and 
VectorOperators.Binary.
Should we use VectorOperators.Ternary here as well then?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 321:

> 319: public final
> 320: Byte128Vector
> 321: lanewise(Ternary op, Vector v1, Vector v2, 
> VectorMask m) {

Should we use VectorOperators.Ternary here?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java 
line 731:

> 729: @Override
> 730: @ForceInline
> 731: public long toLong() {

Should this and other mask operation methods be final methods?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 603:

> 601: if (opKind(op, VO_SPECIAL)) {
> 602: if (op == ZOMO) {
> 603: return blend(broadcast(-1), compare(NE, 0, m));

This doesn't look correct. The lanes where mask is false should get the 
original lane value in this vector.

-

PR: https://git.openjdk.java.net/jdk/pull/5873


Re: RFR: 8274242: Implement fast-path for ASCII-compatible CharsetEncoders on x86

2021-09-24 Thread Sandhya Viswanathan
On Tue, 21 Sep 2021 21:58:48 GMT, Claes Redestad  wrote:

> This patch extends the `ISO_8859_1.implEncodeISOArray` intrinsic on x86 to 
> work also for ASCII encoding, which makes for example the `UTF_8$Encoder` 
> perform on par with (or outperform) similarly getting charset encoded bytes 
> from a String. The former took a small performance hit in JDK 9, and the 
> latter improved greatly in the same release.
> 
> Extending the `EncodeIsoArray` intrinsics on other platforms should be 
> possible, but I'm unfamiliar with the macro assembler in general and unlike 
> the x86 intrinsic they don't use a simple vectorized mask to implement the 
> latin-1 check. For example aarch64 seem to filter out the low bytes and then 
> check if there's any bits set in the high bytes. Clever, but very different 
> to the 0xFF80 2-byte mask that an ASCII test wants.

x86 part of changes look good.

-

PR: https://git.openjdk.java.net/jdk/pull/5621


Integrated: 8273450: Fix the copyright header of SVML files

2021-09-08 Thread Sandhya Viswanathan
On Tue, 7 Sep 2021 20:25:25 GMT, Sandhya Viswanathan  
wrote:

> Fix the copyright header of SVML files to match others.
> 
> This was brought up on jdk-dev mailing list:
> https://mail.openjdk.java.net/pipermail/jdk-dev/2021-September/005992.html

This pull request has now been integrated.

Changeset: d7efd0e8
Author:Sandhya Viswanathan 
URL:   
https://git.openjdk.java.net/jdk/commit/d7efd0e8cf14c732427d2c1363b60278bebdc06a
Stats: 288 lines in 72 files changed: 144 ins; 0 del; 144 mod

8273450: Fix the copyright header of SVML files

Reviewed-by: dholmes, psandoz

-

PR: https://git.openjdk.java.net/jdk/pull/5399


Re: RFR: 8273450: Fix the copyright header of SVML files

2021-09-08 Thread Sandhya Viswanathan
On Wed, 8 Sep 2021 02:03:12 GMT, Paul Sandoz  wrote:

>> Fix the copyright header of SVML files to match others.
>> 
>> This was brought up on jdk-dev mailing list:
>> https://mail.openjdk.java.net/pipermail/jdk-dev/2021-September/005992.html
>
> Marked as reviewed by psandoz (Reviewer).

Thanks a lot @PaulSandoz for the review.

-

PR: https://git.openjdk.java.net/jdk/pull/5399


Re: RFR: 8273450: Fix the copyright header of SVML files

2021-09-07 Thread Sandhya Viswanathan
On Tue, 7 Sep 2021 23:39:54 GMT, David Holmes  wrote:

>> @dholmes-ora I am from Intel so editing the Intel copyright line should be 
>> ok?
>
> @sviswa7 My apologies, I hadn't realized you worked for Intel. But note that 
> other Intel files i.e. ./hotspot/cpu/x86/macroAssembler_x86_*.cpp also do not 
> have "All rights reserved".
> 
> David

Thanks a lot @dholmes-ora for the review.

-

PR: https://git.openjdk.java.net/jdk/pull/5399


Re: RFR: 8273450: Fix the copyright header of SVML files

2021-09-07 Thread Sandhya Viswanathan
On Tue, 7 Sep 2021 23:08:08 GMT, David Holmes  wrote:

>> Fix the copyright header of SVML files to match others.
>> 
>> This was brought up on jdk-dev mailing list:
>> https://mail.openjdk.java.net/pipermail/jdk-dev/2021-September/005992.html
>
> Hi Sandhya,
> 
> You must not change another company's copyright line, so "All rights 
> reserved" needs to be removed again.
> 
> Thanks,
> David

@dholmes-ora I am from Intel so editing the Intel copyright line should be ok?

-

PR: https://git.openjdk.java.net/jdk/pull/5399


RFR: 8273450: Fix the copyright header of SVML files

2021-09-07 Thread Sandhya Viswanathan
Fix the copyright header of SVML files to match others.

This was brought up on jdk-dev mailing list:
https://mail.openjdk.java.net/pipermail/jdk-dev/2021-September/005992.html

-

Commit messages:
 - 8273450: Fix the copyright header of SVML file

Changes: https://git.openjdk.java.net/jdk/pull/5399/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=5399=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8273450
  Stats: 288 lines in 72 files changed: 144 ins; 0 del; 144 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5399.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5399/head:pull/5399

PR: https://git.openjdk.java.net/jdk/pull/5399


Integrated: 8272861: Add a micro benchmark for vector api

2021-08-30 Thread Sandhya Viswanathan
On Mon, 23 Aug 2021 23:18:28 GMT, Sandhya Viswanathan 
 wrote:

> This pull request adds a micro benchmark for Vector API.
> The Black Scholes algorithm is implemented with and without Vector API.
> We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
> vectors.

This pull request has now been integrated.

Changeset: 5aaa20f8
Author:Sandhya Viswanathan 
URL:   
https://git.openjdk.java.net/jdk/commit/5aaa20f898e8679ef1c2c36bd01d48c17be0aacf
Stats: 189 lines in 1 file changed: 189 ins; 0 del; 0 mod

8272861: Add a micro benchmark for vector api

Reviewed-by: psandoz

-

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8272861: Add a micro benchmark for vector api [v4]

2021-08-26 Thread Sandhya Viswanathan
> This pull request adds a micro benchmark for Vector API.
> The Black Scholes algorithm is implemented with and without Vector API.
> We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
> vectors.

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  No need to normalize nextFloat

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/5234/files
  - new: https://git.openjdk.java.net/jdk/pull/5234/files/5b4abbf9..df22def3

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=5234=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=5234=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5234.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5234/head:pull/5234

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8272861: Add a micro benchmark for vector api [v3]

2021-08-25 Thread Sandhya Viswanathan
On Tue, 24 Aug 2021 20:49:52 GMT, Sandhya Viswanathan 
 wrote:

>> This pull request adds a micro benchmark for Vector API.
>> The Black Scholes algorithm is implemented with and without Vector API.
>> We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
>> vectors.
>
> Sandhya Viswanathan has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Make constants as static final

@PaulSandoz  @ericcaspole Looking forward to your review and approval for this 
vector api micro benchmark.

-

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8272861: Add a micro benchmark for vector api [v3]

2021-08-24 Thread Sandhya Viswanathan
On Tue, 24 Aug 2021 10:09:13 GMT, Aleksey Shipilev  wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Make constants as static final
>
> Some benchmark comments.

@shipilev @nsjian Thanks a lot for the feedback. I have implemented your review 
comments.

-

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8272861: Add a micro benchmark for vector api [v3]

2021-08-24 Thread Sandhya Viswanathan
> This pull request adds a micro benchmark for Vector API.
> The Black Scholes algorithm is implemented with and without Vector API.
> We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
> vectors.

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Make constants as static final

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/5234/files
  - new: https://git.openjdk.java.net/jdk/pull/5234/files/f92994cd..5b4abbf9

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=5234=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=5234=01-02

  Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5234.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5234/head:pull/5234

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8272861: Add a micro benchmark for vector api [v2]

2021-08-24 Thread Sandhya Viswanathan
> This pull request adds a micro benchmark for Vector API.
> The Black Scholes algorithm is implemented with and without Vector API.
> We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
> vectors.

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Implement review comments

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/5234/files
  - new: https://git.openjdk.java.net/jdk/pull/5234/files/ca688faa..f92994cd

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=5234=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=5234=00-01

  Stats: 4 lines in 1 file changed: 0 ins; 1 del; 3 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5234.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5234/head:pull/5234

PR: https://git.openjdk.java.net/jdk/pull/5234


RFR: 8272861: Add a micro benchmark for vector api

2021-08-23 Thread Sandhya Viswanathan
This pull request adds a micro benchmark for Vector API.
The Black Scholes algorithm is implemented with and without Vector API.
We see about ~6x gain with Vector API for this micro benchmark using 256 bit 
vectors.

-

Commit messages:
 - whitespace
 - 8272861: Add a micro benchmark for vector api

Changes: https://git.openjdk.java.net/jdk/pull/5234/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=5234=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8272861
  Stats: 190 lines in 1 file changed: 190 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5234.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5234/head:pull/5234

PR: https://git.openjdk.java.net/jdk/pull/5234


Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-28 Thread Sandhya Viswanathan
On Wed, 28 Jul 2021 04:48:35 GMT, Vladimir Kozlov  wrote:

>> Looks good to me.
>
> @sviswa7 and @jatin-bhateja jatin-bhateja
> The push caused https://bugs.openjdk.java.net/browse/JDK-8271366
> I am strongly suggest in a future to ask an Oracle's engineer to test Intel's 
> changes before pushing.

@vnkozlov  @PaulSandoz Sorry for the inconvenience. @jatin-bhateja Please don't 
be in a hurry to push and reach out to Oracle engineers for testing before 
pushing.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-27 Thread Sandhya Viswanathan
On Tue, 27 Jul 2021 18:05:49 GMT, Sandhya Viswanathan 
 wrote:

>> Correcting this,  I2L may be needed in auto-vectorization flow since 
>> Integer/Long.rotate[Right/Left] APIs accept only integral shift, so for 
>> Long.rotate* operations integral shift value must be converted to long using 
>> I2L before broadcasting it. VectorAPI lanewise operations between 
>> vector-scalar, scalar type already matches with vector basic type.  Since 
>> degeneration routine is common b/w both the flows so maintaining IR 
>> consistency here.
>
> For Vector API the shift is always coming in as int type for rotate by scalar 
> (lanewiseShiftTemplate). The down conversion to byte or short needs to be 
> done before scalar2vector.

I see that similar thing is done before for shift, so down conversion to sub 
type is not required.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-27 Thread Sandhya Viswanathan
On Tue, 20 Jul 2021 09:57:07 GMT, Jatin Bhateja  wrote:

>> Current VectorAPI Java side implementation expresses rotateLeft and 
>> rotateRight operation using following operations:-
>> 
>> vec1 = lanewise(VectorOperators.LSHL, n)
>> vec2 = lanewise(VectorOperators.LSHR, n)
>> res = lanewise(VectorOperations.OR, vec1 , vec2)
>> 
>> This patch moves above handling from Java side to C2 compiler which 
>> facilitates dismantling the rotate operation if target ISA does not support 
>> a direct rotate instruction.
>> 
>> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
>> long and integer type vectors. For other cases (i.e. sub-word type vectors 
>> or for targets which do not support direct rotate operations )   instruction 
>> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
>> 
>> Please find below the performance data for included JMH benchmark.
>> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40;>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark | (bits) | (shift) | (size) | Baseline Score (ops/ms) | With Opts 
>> (ops/ms) | Gain
>> -- | -- | -- | -- | -- | -- | --
>> RotateBenchmark.testRotateLeftB | 128 | 7 | 256 | 3939.136 | 3836.133 | 
>> 0.973851372
>> RotateBenchmark.testRotateLeftB | 128 | 7 | 512 | 1984.231 | 1918.27 | 
>> 0.966757399
>> RotateBenchmark.testRotateLeftB | 128 | 15 | 256 | 3925.165 | 4043.842 | 
>> 1.030234907
>> RotateBenchmark.testRotateLeftB | 128 | 15 | 512 | 1962.723 | 1936.551 | 
>> 0.986665464
>> RotateBenchmark.testRotateLeftB | 128 | 31 | 256 | 3945.6 | 3817.883 | 
>> 0.967630525
>> RotateBenchmark.testRotateLeftB | 128 | 31 | 512 | 1944.458 | 1914.229 | 
>> 0.984453766
>> RotateBenchmark.testRotateLeftB | 256 | 7 | 256 | 4612.149 | 4514.874 | 
>> 0.978908964
>> RotateBenchmark.testRotateLeftB | 256 | 7 | 512 | 2296.252 | 2270.237 | 
>> 0.988670669
>> RotateBenchmark.testRotateLeftB | 256 | 15 | 256 | 4576.628 | 4515.53 | 
>> 0.986649996
>> RotateBenchmark.testRotateLeftB | 256 | 15 | 512 | 2288.278 | 2270.923 | 
>> 0.992415694
>> RotateBenchmark.testRotateLeftB | 256 | 31 | 256 | 4624.243 | 4511.46 | 
>> 0.975610495
>> RotateBenchmark.testRotateLeftB | 256 | 31 | 512 | 2305.459 | 2273.788 | 
>> 0.986262605
>> RotateBenchmark.testRotateLeftB | 512 | 7 | 256 | 7748.283 | .105 | 
>> 1.003719792
>> RotateBenchmark.testRotateLeftB | 512 | 7 | 512 | 3906.214 | 3912.647 | 
>> 1.001646863
>> RotateBenchmark.testRotateLeftB | 512 | 15 | 256 | 7764.653 | 7763.482 | 
>> 0.999849188
>> RotateBenchmark.testRotateLeftB | 512 | 15 | 512 | 3916.061 | 3919.363 | 
>> 1.000843194
>> RotateBenchmark.testRotateLeftB | 512 | 31 | 256 | 7779.754 | 7770.239 | 
>> 0.998776954
>> RotateBenchmark.testRotateLeftB | 512 | 31 | 512 | 3916.471 | 3912.718 | 
>> 0.999041739
>> RotateBenchmark.testRotateLeftI | 128 | 7 | 256 | 4043.39 | 13461.814 | 
>> 3.329338501
>> RotateBenchmark.testRotateLeftI | 128 | 7 | 512 | 1996.217 | 6455.425 | 
>> 3.233829288
>> RotateBenchmark.testRotateLeftI | 128 | 15 | 256 | 4028.614 | 13077.277 | 
>> 3.246098286
>> RotateBenchmark.testRotateLeftI | 128 | 15 | 512 | 1997.612 | 6452.918 | 
>> 3.230315997
>> RotateBenchmark.testRotateLeftI | 128 | 31 | 256 | 4123.357 | 13079.045 | 
>> 3.171940969
>> RotateBenchmark.testRotateLeftI | 128 | 31 | 512 | 2003.356 | 6452.716 | 
>> 3.22095324
>> RotateBenchmark.testRotateLeftI | 256 | 7 | 256 | 7666.949 | 25658.625 | 
>> 3.34665393
>> RotateBenchmark.testRotateLeftI | 256 | 7 | 512 | 3855.826 | 12278.106 | 
>> 3.18429981
>> RotateBenchmark.testRotateLeftI | 256 | 15 | 256 | 7670.901 | 24625.466 | 
>> 3.210244272
>> RotateBenchmark.testRotateLeftI | 256 | 15 | 512 | 3765.786 | 12272.771 | 
>> 3.259019764
>> RotateBenchmark.testRotateLeftI | 256 | 31 | 256 | 7660.599 | 25678.864 | 
>> 3.352069988
>> RotateBenchmark.testRotateLeftI | 256 | 31 | 512 | 3773.401 | 12006.469 | 
>> 3.181869353
>> RotateBenchmark.testRotateLeftI | 512 | 7 | 256 | 11900.948 | 31242.989 | 
>> 2.625252123
>> RotateBenchmark.testRotateLeftI | 512 | 7 | 512 | 5830.878 | 15727.149 | 
>> 2.697217983
>> RotateBenchmark.testRotateLeftI | 512 | 15 | 256 | 12171.847 | 33180.067 | 
>> 2.72596813
>> RotateBenchmark.testRotateLeftI | 512 | 15 | 512 | 5830.544 | 16740.182 | 
>> 2.871118372
>> RotateBenchmark.testRotateLeftI | 512 | 31 | 256 | 11909.553 | 31250.882 | 
>> 2.624018047
>> RotateBenchmark.testRotateLeftI | 512 | 31 | 512 | 5846.747 | 15738.831 | 
>> 2.691895339
>> RotateBenchmark.testRotateLeftL | 128 | 7 | 256 | 2047.243 | 6888.484 | 
>> 3.364761291
>> RotateBenchmark.testRotateLeftL | 128 | 7 | 512 | 1005.029 | 3245.931 

Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-27 Thread Sandhya Viswanathan
On Tue, 27 Jul 2021 08:17:55 GMT, Jatin Bhateja  wrote:

>> src/hotspot/share/opto/vectorIntrinsics.cpp line 1598:
>> 
>>> 1596:   cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) 
>>> : cnt;
>>> 1597:   opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, 
>>> type_bt));
>>> 1598: } else {
>> 
>> Why conversion for only T_LONG and not for T_BYTE and T_SHORT? Is there an 
>> assumption here that only T_INT and T_LONG elem_bt are supported?
>
> Correcting this,  I2L may be needed in auto-vectorization flow since 
> Integer/Long.rotate[Right/Left] APIs accept only integral shift, so for 
> Long.rotate* operations integral shift value must be converted to long using 
> I2L before broadcasting it. VectorAPI lanewise operations between 
> vector-scalar, scalar type already matches with vector basic type.  Since 
> degeneration routine is common b/w both the flows so maintaining IR 
> consistency here.

For Vector API the shift is always coming in as int type for rotate by scalar 
(lanewiseShiftTemplate). The down conversion to byte or short needs to be done 
before scalar2vector.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v13]

2021-07-26 Thread Sandhya Viswanathan
On Tue, 20 Jul 2021 09:57:07 GMT, Jatin Bhateja  wrote:

>> Current VectorAPI Java side implementation expresses rotateLeft and 
>> rotateRight operation using following operations:-
>> 
>> vec1 = lanewise(VectorOperators.LSHL, n)
>> vec2 = lanewise(VectorOperators.LSHR, n)
>> res = lanewise(VectorOperations.OR, vec1 , vec2)
>> 
>> This patch moves above handling from Java side to C2 compiler which 
>> facilitates dismantling the rotate operation if target ISA does not support 
>> a direct rotate instruction.
>> 
>> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
>> long and integer type vectors. For other cases (i.e. sub-word type vectors 
>> or for targets which do not support direct rotate operations )   instruction 
>> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
>> 
>> Please find below the performance data for included JMH benchmark.
>> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40;>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark | (bits) | (shift) | (size) | Baseline Score (ops/ms) | With Opts 
>> (ops/ms) | Gain
>> -- | -- | -- | -- | -- | -- | --
>> RotateBenchmark.testRotateLeftB | 128 | 7 | 256 | 3939.136 | 3836.133 | 
>> 0.973851372
>> RotateBenchmark.testRotateLeftB | 128 | 7 | 512 | 1984.231 | 1918.27 | 
>> 0.966757399
>> RotateBenchmark.testRotateLeftB | 128 | 15 | 256 | 3925.165 | 4043.842 | 
>> 1.030234907
>> RotateBenchmark.testRotateLeftB | 128 | 15 | 512 | 1962.723 | 1936.551 | 
>> 0.986665464
>> RotateBenchmark.testRotateLeftB | 128 | 31 | 256 | 3945.6 | 3817.883 | 
>> 0.967630525
>> RotateBenchmark.testRotateLeftB | 128 | 31 | 512 | 1944.458 | 1914.229 | 
>> 0.984453766
>> RotateBenchmark.testRotateLeftB | 256 | 7 | 256 | 4612.149 | 4514.874 | 
>> 0.978908964
>> RotateBenchmark.testRotateLeftB | 256 | 7 | 512 | 2296.252 | 2270.237 | 
>> 0.988670669
>> RotateBenchmark.testRotateLeftB | 256 | 15 | 256 | 4576.628 | 4515.53 | 
>> 0.986649996
>> RotateBenchmark.testRotateLeftB | 256 | 15 | 512 | 2288.278 | 2270.923 | 
>> 0.992415694
>> RotateBenchmark.testRotateLeftB | 256 | 31 | 256 | 4624.243 | 4511.46 | 
>> 0.975610495
>> RotateBenchmark.testRotateLeftB | 256 | 31 | 512 | 2305.459 | 2273.788 | 
>> 0.986262605
>> RotateBenchmark.testRotateLeftB | 512 | 7 | 256 | 7748.283 | .105 | 
>> 1.003719792
>> RotateBenchmark.testRotateLeftB | 512 | 7 | 512 | 3906.214 | 3912.647 | 
>> 1.001646863
>> RotateBenchmark.testRotateLeftB | 512 | 15 | 256 | 7764.653 | 7763.482 | 
>> 0.999849188
>> RotateBenchmark.testRotateLeftB | 512 | 15 | 512 | 3916.061 | 3919.363 | 
>> 1.000843194
>> RotateBenchmark.testRotateLeftB | 512 | 31 | 256 | 7779.754 | 7770.239 | 
>> 0.998776954
>> RotateBenchmark.testRotateLeftB | 512 | 31 | 512 | 3916.471 | 3912.718 | 
>> 0.999041739
>> RotateBenchmark.testRotateLeftI | 128 | 7 | 256 | 4043.39 | 13461.814 | 
>> 3.329338501
>> RotateBenchmark.testRotateLeftI | 128 | 7 | 512 | 1996.217 | 6455.425 | 
>> 3.233829288
>> RotateBenchmark.testRotateLeftI | 128 | 15 | 256 | 4028.614 | 13077.277 | 
>> 3.246098286
>> RotateBenchmark.testRotateLeftI | 128 | 15 | 512 | 1997.612 | 6452.918 | 
>> 3.230315997
>> RotateBenchmark.testRotateLeftI | 128 | 31 | 256 | 4123.357 | 13079.045 | 
>> 3.171940969
>> RotateBenchmark.testRotateLeftI | 128 | 31 | 512 | 2003.356 | 6452.716 | 
>> 3.22095324
>> RotateBenchmark.testRotateLeftI | 256 | 7 | 256 | 7666.949 | 25658.625 | 
>> 3.34665393
>> RotateBenchmark.testRotateLeftI | 256 | 7 | 512 | 3855.826 | 12278.106 | 
>> 3.18429981
>> RotateBenchmark.testRotateLeftI | 256 | 15 | 256 | 7670.901 | 24625.466 | 
>> 3.210244272
>> RotateBenchmark.testRotateLeftI | 256 | 15 | 512 | 3765.786 | 12272.771 | 
>> 3.259019764
>> RotateBenchmark.testRotateLeftI | 256 | 31 | 256 | 7660.599 | 25678.864 | 
>> 3.352069988
>> RotateBenchmark.testRotateLeftI | 256 | 31 | 512 | 3773.401 | 12006.469 | 
>> 3.181869353
>> RotateBenchmark.testRotateLeftI | 512 | 7 | 256 | 11900.948 | 31242.989 | 
>> 2.625252123
>> RotateBenchmark.testRotateLeftI | 512 | 7 | 512 | 5830.878 | 15727.149 | 
>> 2.697217983
>> RotateBenchmark.testRotateLeftI | 512 | 15 | 256 | 12171.847 | 33180.067 | 
>> 2.72596813
>> RotateBenchmark.testRotateLeftI | 512 | 15 | 512 | 5830.544 | 16740.182 | 
>> 2.871118372
>> RotateBenchmark.testRotateLeftI | 512 | 31 | 256 | 11909.553 | 31250.882 | 
>> 2.624018047
>> RotateBenchmark.testRotateLeftI | 512 | 31 | 512 | 5846.747 | 15738.831 | 
>> 2.691895339
>> RotateBenchmark.testRotateLeftL | 128 | 7 | 256 | 2047.243 | 6888.484 | 
>> 3.364761291
>> RotateBenchmark.testRotateLeftL | 128 | 7 | 512 | 1005.029 | 3245.931 

Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-26 Thread Sandhya Viswanathan
On Sun, 18 Jul 2021 20:22:18 GMT, Jatin Bhateja  wrote:

>> src/hotspot/share/opto/vectornode.cpp line 1180:
>> 
>>> 1178:   cnt = cnt->in(1);
>>> 1179: }
>>> 1180: shiftRCnt = cnt;
>> 
>> Why do we remove the And with mask here?
>
> And'ing with shift_mask is already done on Java API side implementation 
> before making a call to intrinsic rountine.

@jatin-bhateja  This question is still pending.

-

PR: https://git.openjdk.java.net/jdk/pull/3720


Re: RFR: 8266054: VectorAPI rotate operation optimization [v10]

2021-07-15 Thread Sandhya Viswanathan
On Thu, 15 Jul 2021 08:34:42 GMT, Jatin Bhateja  wrote:

>> Current VectorAPI Java side implementation expresses rotateLeft and 
>> rotateRight operation using following operations:-
>> 
>> vec1 = lanewise(VectorOperators.LSHL, n)
>> vec2 = lanewise(VectorOperators.LSHR, n)
>> res = lanewise(VectorOperations.OR, vec1 , vec2)
>> 
>> This patch moves above handling from Java side to C2 compiler which 
>> facilitates dismantling the rotate operation if target ISA does not support 
>> a direct rotate instruction.
>> 
>> AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over 
>> long and integer type vectors. For other cases (i.e. sub-word type vectors 
>> or for targets which do not support direct rotate operations )   instruction 
>> sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.
>> 
>> Please find below the performance data for included JMH benchmark.
>> Machine:  Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>> 
>> 
>> Benchmark | (TESTSIZE) | Shift | Baseline AVX3 (ops/ms) | Withopt  AVX3 
>> (ops/ms) | Gain % | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain %
>> -- | -- | -- | -- | -- | -- | -- | -- | --
>>   |   |   |   |   |   |   |   |  
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 17223.35 | 17094.69 | 
>> -0.75 | 17008.32 | 17488.06 | 2.82
>> RotateBenchmark.testRotateLeftB | 128.00 | 7.00 | 8944.98 | 8811.34 | -1.49 
>> | 8878.17 | 9218.68 | 3.84
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 17195.75 | 17137.32 | 
>> -0.34 | 16789.01 | 17780.34 | 5.90
>> RotateBenchmark.testRotateLeftB | 128.00 | 15.00 | 9052.67 | 8838.60 | -2.36 
>> | 8814.62 | 9206.01 | 4.44
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 17100.19 | 16950.64 | 
>> -0.87 | 16827.73 | 17720.37 | 5.30
>> RotateBenchmark.testRotateLeftB | 128.00 | 31.00 | 9079.95 | 8471.26 | -6.70 
>> | .44 | 9167.68 | 3.14
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 21231.33 | 21513.08 | 1.33 
>> | 21824.51 | 21479.48 | -1.58
>> RotateBenchmark.testRotateLeftB | 256.00 | 7.00 | 11103.62 | 11180.16 | 0.69 
>> | 11173.67 | 11529.22 | 3.18
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 21119.14 | 21552.04 | 
>> 2.05 | 21693.05 | 21915.37 | 1.02
>> RotateBenchmark.testRotateLeftB | 256.00 | 15.00 | 11048.68 | 11094.20 | 
>> 0.41 | 11049.90 | 11439.07 | 3.52
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 21506.31 | 21391.41 | 
>> -0.53 | 21263.18 | 21986.29 | 3.40
>> RotateBenchmark.testRotateLeftB | 256.00 | 31.00 | 11056.12 | 11232.78 | 
>> 1.60 | 10941.59 | 11397.09 | 4.16
>> RotateBenchmark.testRotateLeftB | 512.00 | 7.00 | 17976.56 | 18180.85 | 1.14 
>> | 1212.26 | 2533.34 | 108.98
>> RotateBenchmark.testRotateLeftB | 512.00 | 15.00 | 17553.70 | 18219.07 | 
>> 3.79 | 1256.73 | 2537.41 | 101.91
>> RotateBenchmark.testRotateLeftB | 512.00 | 31.00 | 17618.03 | 17738.15 | 
>> 0.68 | 1214.69 | 2533.83 | 108.60
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 7258.87 | 7468.88 | 2.89 | 
>> 7115.12 | 7117.26 | 0.03
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 3586.65 | 3950.85 | 10.15 
>> | 3532.17 | 3595.80 | 1.80
>> RotateBenchmark.testRotateLeftI | 128.00 | 7.00 | 1835.07 | 1999.68 | 8.97 | 
>> 1789.90 | 1819.93 | 1.68
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 7273.36 | 7410.91 | 1.89 
>> | 7198.60 | 6994.79 | -2.83
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 3674.98 | 3926.27 | 6.84 
>> | 3549.90 | 3755.09 | 5.78
>> RotateBenchmark.testRotateLeftI | 128.00 | 15.00 | 1840.94 | 1882.25 | 2.24 
>> | 1801.56 | 1872.89 | 3.96
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 7457.11 | 7361.48 | -1.28 
>> | 6975.33 | 7385.94 | 5.89
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 3570.74 | 3929.30 | 10.04 
>> | 3635.37 | 3736.67 | 2.79
>> RotateBenchmark.testRotateLeftI | 128.00 | 31.00 | 1902.32 | 1960.46 | 3.06 
>> | 1812.32 | 1813.88 | 0.09
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 11174.24 | 12044.52 | 7.79 
>> | 11509.87 | 11273.44 | -2.05
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 5981.47 | 6073.70 | 1.54 | 
>> 5593.66 | 5661.93 | 1.22
>> RotateBenchmark.testRotateLeftI | 256.00 | 7.00 | 2932.49 | 3069.54 | 4.67 | 
>> 2950.86 | 2892.42 | -1.98
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 11764.11 | 12098.63 | 
>> 2.84 | 11069.52 | 11476.93 | 3.68
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 5855.20 | 6080.40 | 3.85 
>> | 5919.11 | 5607.04 | -5.27
>> RotateBenchmark.testRotateLeftI | 256.00 | 15.00 | 2989.05 | 3048.56 | 1.99 
>> | 2902.63 | 2821.83 | -2.78
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 11652.84 | 11965.40 | 
>> 2.68 | 11525.62 | 11459.83 | -0.57
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 5851.82 | 6164.94 | 5.35 
>> | 5882.60 | 5842.30 | -0.69
>> RotateBenchmark.testRotateLeftI | 256.00 | 31.00 | 3015.99 | 3043.79 | 0.92 
>> | 2963.71 | 2947.97 | -0.53
>> RotateBenchmark.testRotateLeftI | 512.00 | 7.00 | 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Sandhya Viswanathan
On Thu, 24 Jun 2021 14:50:01 GMT, Vladimir Kozlov  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing Windows build warnings
>
> The rest of testing hs-tier1-4 and xcomp is finished and clean.
> So this is the only failure. I attached hs_err file to RFE.

Thanks a lot @vnkozlov for the review and test.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v6]

2021-06-22 Thread Sandhya Viswanathan
On Tue, 22 Jun 2021 20:47:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments.
>   
>   1. Changed errorvec handling
>   2. Removed unnecessary register copies and aliasing
>   3. Streamlined mask generation

@asgibbons The patch looks good to me.

@vnkozlov We need one more review for this patch. Could you please help?

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v6]

2021-06-22 Thread Sandhya Viswanathan
On Tue, 22 Jun 2021 20:47:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments.
>   
>   1. Changed errorvec handling
>   2. Removed unnecessary register copies and aliasing
>   3. Streamlined mask generation

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

2021-06-19 Thread Sandhya Viswanathan
On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6155:

> 6153:   __ subl(output_size, length);
> 6154:   __ movq(rax, -1);
> 6155:   __ shrxq(rax, rax, output_size);// Input mask in rax

I think this could also be implemented as:
__ movq(rax, -1);
__ bzhiq(rax, rax, length);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6173:

> 6171:   __ movq(rax, 64);
> 6172:   __ subq(rax, output_size);
> 6173:   __ shrxq(output_mask, output_mask, rax);

The output mask can also be computed using bzhiq:
__ movq(output_mask, -1);
__ bzhiq(output_mask, output_mask, output_size);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6243:

> 6241: 
> 6242:   __ BIND(L_padding);
> 6243:   __ decrementq(r13, 1);

It will be good to use output_size here instead of r13.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6249:

> 6247:   __ jcc(Assembler::notEqual, L_donePadding);
> 6248: 
> 6249:   __ decrementq(r13, 1);

It will be good to 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

2021-06-18 Thread Sandhya Viswanathan
On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6004:

> 6002:   __ BIND(L_continue);
> 6003: 
> 6004:   __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6023:

> 6021:   __ evmovdquq(tmp16_op3, pack16_op, Assembler::AVX_512bit);
> 6022:   __ evmovdquq(tmp16_op2, pack16_op, Assembler::AVX_512bit);
> 6023:   __ evmovdquq(tmp16_op1, pack16_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack16_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6026:

> 6024:   __ evmovdquq(tmp32_op3, pack32_op, Assembler::AVX_512bit);
> 6025:   __ evmovdquq(tmp32_op2, pack32_op, Assembler::AVX_512bit);
> 6026:   __ evmovdquq(tmp32_op1, pack32_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack32_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6051:

> 6049:   __ vpternlogd(t0, 0xfe, input1, input2, 

Re: [jdk17] RFR: 8266518: Refactor and expand scatter/gather tests [v2]

2021-06-17 Thread Sandhya Viswanathan
On Thu, 17 Jun 2021 15:09:17 GMT, Paul Sandoz  wrote:

>> Refactor scatter/gather tests to be included in the load/store test classes 
>> and expand to support access between `ShortVector` and and `char[]`, and 
>> access between `ByteVector` and `boolean[]`.
>> 
>> Vector tests pass on linux-x64 linux-aarch64 macosx-x64, and windows-x64.
>
> Paul Sandoz has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove redundatem data providers.

Looks good to me.

-

PR: https://git.openjdk.java.net/jdk17/pull/48


Re: [jdk17] RFR: 8266518: Refactor and expand scatter/gather tests

2021-06-16 Thread Sandhya Viswanathan
On Mon, 14 Jun 2021 16:26:17 GMT, Paul Sandoz  wrote:

> Refactor scatter/gather tests to be included in the load/store test classes 
> and expand to support access between `ShortVector` and and `char[]`, and 
> access between `ByteVector` and `boolean[]`.
> 
> Vector tests pass on linux-x64 linux-aarch64 macosx-x64, and windows-x64.

Thanks for refactoring and adding new scatter/gather tests for boolean and 
char. Other than the duplicate data provider in byte/boolean tests, the rest 
looks good to me.

test/jdk/jdk/incubator/vector/Byte128VectorLoadStoreTests.java line 1248:

> 1246: toArray(Object[][]::new);
> 1247: }
> 1248: 

The byteGatherScatterProvider and byteGatherScatterMaskProvider seem to be same 
as gatherScatterProvider and gatherScatterMaskProvider above.

-

Marked as reviewed by sviswanathan (Reviewer).

PR: https://git.openjdk.java.net/jdk17/pull/48


Re: [jdk17] RFR: 8268353: Test libsvml.so is and is not present in jdk image

2021-06-15 Thread Sandhya Viswanathan
On Mon, 14 Jun 2021 16:06:04 GMT, Paul Sandoz  wrote:

> Test that when the jdk.incubator.vector module is present that libsvml.so is 
> present, and test the opposite case.

Looks good to me.

-

Marked as reviewed by sviswanathan (Reviewer).

PR: https://git.openjdk.java.net/jdk17/pull/47


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-08 Thread Sandhya Viswanathan
On Tue, 8 Jun 2021 00:30:38 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing review comments.  Adding notes about isMIME parameter for other 
> architectures; clarifying decodeBlock comments.

@asgibbons Thanks a lot for contributing this. The performance gain is 
impressive. I have some minor comments. Please take a look.

src/hotspot/cpu/x86/assembler_x86.cpp line 4555:

> 4553: void Assembler::evpmaddubsw(XMMRegister dst, XMMRegister src1, 
> XMMRegister src2, int vector_len) {
> 4554:   assert(VM_Version::supports_avx512bw(), "");
> 4555:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

This instruction is also supported on AVX platforms. The assert check could be 
as follows:
  assert(vector_len == AVX_128bit? VM_Version::supports_avx() :
 vector_len == AVX_256bit? VM_Version::supports_avx2() :
 vector_len == AVX_512bit? VM_Version::supports_avx512bw() : 0, "");
Accordingly the instruction could be named as vpmaddubsw.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 

Integrated: 8268151: Vector API toShuffle optimization

2021-06-04 Thread Sandhya Viswanathan
On Thu, 3 Jun 2021 00:29:00 GMT, Sandhya Viswanathan  
wrote:

> The Vector API toShuffle method can be optimized  using existing vector 
> conversion intrinsic.
> 
> The following changes are made:
> 1) vector.toShuffle java implementation is changed to call 
> VectorSupport.convert.
> 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp 
> is changed to allow shuffle as a destination type.
> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
> performing conversion during unbox. This is to remove unnecessary boxing 
> during back to back vector.toShuffle and shuffle.toVector calls. 
> 
> Best Regards,
> Sandhya

This pull request has now been integrated.

Changeset: 20b63127
Author:Sandhya Viswanathan 
URL:   
https://git.openjdk.java.net/jdk/commit/20b631278c0c89ccd9c16f2a29d47eb8414aacd5
Stats: 399 lines in 41 files changed: 165 ins; 197 del; 37 mod

8268151: Vector API toShuffle optimization

Reviewed-by: psandoz, vlivanov

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-04 Thread Sandhya Viswanathan
On Fri, 4 Jun 2021 13:03:24 GMT, Vladimir Ivanov  wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Implement review comments
>
> Looks good. 
> 
> One inefficiency I noticed is that repeated `toVector()`/`toShuffle` leave a 
> trail of redundant `VectorCastB2X`/`VectorCast[S..L]2X` nodes behind.

@iwanowww @XiaohongGong Thanks a lot for the review.
@iwanowww I will take up the redundant VectorCastB2X/VectorCast[S..L]2X 
conversion optimizations separately.

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-03 Thread Sandhya Viswanathan
On Thu, 3 Jun 2021 22:01:12 GMT, Paul Sandoz  wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Implement review comments
>
> Java changes look good.

@PaulSandoz Thanks a lot for the review.

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-03 Thread Sandhya Viswanathan
On Thu, 3 Jun 2021 02:31:51 GMT, Xiaohong Gong  wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Implement review comments
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java
>  line 335:
> 
>> 333: @ForceInline
>> 334: private final
>> 335: VectorShuffle toShuffleTemplate(AbstractSpecies dsp) {
> 
> Is it better to move this template method to the super class like other APIs?

Yes, can be moved to super class. Done in the updated commit.

> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java
>  line 350:
> 
>> 348:  Byte128Shuffle.class, byte.class, 
>> VLENGTH,
>> 349:  this, VSPECIES,
>> 350:  Byte128Vector::toShuffleTemplate);
> 
> ditto

Yes, can be moved to super class. Done in the updated commit.

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8268151: Vector API toShuffle optimization [v2]

2021-06-03 Thread Sandhya Viswanathan
> The Vector API toShuffle method can be optimized  using existing vector 
> conversion intrinsic.
> 
> The following changes are made:
> 1) vector.toShuffle java implementation is changed to call 
> VectorSupport.convert.
> 2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp 
> is changed to allow shuffle as a destination type.
> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
> performing conversion during unbox. This is to remove unnecessary boxing 
> during back to back vector.toShuffle and shuffle.toVector calls. 
> 
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Implement review comments

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4326/files
  - new: https://git.openjdk.java.net/jdk/pull/4326/files/d5652051..ab582a1c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4326=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4326=00-01

  Stats: 657 lines in 38 files changed: 161 ins; 465 del; 31 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4326.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4326/head:pull/4326

PR: https://git.openjdk.java.net/jdk/pull/4326


Integrated: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics

2021-06-03 Thread Sandhya Viswanathan
On Thu, 22 Apr 2021 19:07:28 GMT, Sandhya Viswanathan 
 wrote:

> This PR contains Short Vector Math Library support related changes for 
> [JEP-414 Vector API (Second Incubator)](https://openjdk.java.net/jeps/414), 
> in preparation for when targeted.
> 
> Intel Short Vector Math Library (SVML) based intrinsics in native x86 
> assembly provide optimized implementation for Vector API transcendental and 
> trigonometric methods.
> These methods are built into a separate library instead of being part of 
> libjvm.so or jvm.dll.
> 
> The following changes are made:
>The source for these methods is placed in the jdk.incubator.vector module 
> under src/jdk.incubator.vector/linux/native/libsvml and 
> src/jdk.incubator.vector/windows/native/libsvml.
>The assembly source files are named as “*.S” and include files are named 
> as “*.S.inc”.
>The corresponding build script is placed at 
> make/modules/jdk.incubator.vector/Lib.gmk.
>Changes are made to build system to support dependency tracking for 
> assembly files with includes.
>The built native libraries (libsvml.so/svml.dll) are placed in bin 
> directory of JDK on Windows and lib directory of JDK on Linux.
>The C2 JIT uses the dll_load and dll_lookup to get the addresses of 
> optimized methods from this library.
> 
> Build system changes and module library build scripts are contributed by 
> Magnus (magnus.ihse.bur...@oracle.com).
> 
> Looking forward to your review and feedback.
> 
> Performance:
> Micro benchmark Base Optimized Unit Gain(Optimized/Base)
> Double128Vector.ACOS 45.91 87.34 ops/ms 1.90
> Double128Vector.ASIN 45.06 92.36 ops/ms 2.05
> Double128Vector.ATAN 19.92 118.36 ops/ms 5.94
> Double128Vector.ATAN2 15.24 88.17 ops/ms 5.79
> Double128Vector.CBRT 45.77 208.36 ops/ms 4.55
> Double128Vector.COS 49.94 245.89 ops/ms 4.92
> Double128Vector.COSH 26.91 126.00 ops/ms 4.68
> Double128Vector.EXP 71.64 379.65 ops/ms 5.30
> Double128Vector.EXPM1 35.95 150.37 ops/ms 4.18
> Double128Vector.HYPOT 50.67 174.10 ops/ms 3.44
> Double128Vector.LOG 61.95 279.84 ops/ms 4.52
> Double128Vector.LOG10 59.34 239.05 ops/ms 4.03
> Double128Vector.LOG1P 18.56 200.32 ops/ms 10.79
> Double128Vector.SIN 49.36 240.79 ops/ms 4.88
> Double128Vector.SINH 26.59 103.75 ops/ms 3.90
> Double128Vector.TAN 41.05 152.39 ops/ms 3.71
> Double128Vector.TANH 45.29 169.53 ops/ms 3.74
> Double256Vector.ACOS 54.21 106.39 ops/ms 1.96
> Double256Vector.ASIN 53.60 107.99 ops/ms 2.01
> Double256Vector.ATAN 21.53 189.11 ops/ms 8.78
> Double256Vector.ATAN2 16.67 140.76 ops/ms 8.44
> Double256Vector.CBRT 56.45 397.13 ops/ms 7.04
> Double256Vector.COS 58.26 389.77 ops/ms 6.69
> Double256Vector.COSH 29.44 151.11 ops/ms 5.13
> Double256Vector.EXP 86.67 564.68 ops/ms 6.52
> Double256Vector.EXPM1 41.96 201.28 ops/ms 4.80
> Double256Vector.HYPOT 66.18 305.74 ops/ms 4.62
> Double256Vector.LOG 71.52 394.90 ops/ms 5.52
> Double256Vector.LOG10 65.43 362.32 ops/ms 5.54
> Double256Vector.LOG1P 19.99 300.88 ops/ms 15.05
> Double256Vector.SIN 57.06 380.98 ops/ms 6.68
> Double256Vector.SINH 29.40 117.37 ops/ms 3.99
> Double256Vector.TAN 44.90 279.90 ops/ms 6.23
> Double256Vector.TANH 54.08 274.71 ops/ms 5.08
> Double512Vector.ACOS 55.65 687.54 ops/ms 12.35
> Double512Vector.ASIN 57.31 777.72 ops/ms 13.57
> Double512Vector.ATAN 21.42 729.21 ops/ms 34.04
> Double512Vector.ATAN2 16.37 414.33 ops/ms 25.32
> Double512Vector.CBRT 56.78 834.38 ops/ms 14.69
> Double512Vector.COS 59.88 837.04 ops/ms 13.98
> Double512Vector.COSH 30.34 172.76 ops/ms 5.70
> Double512Vector.EXP 99.66 1608.12 ops/ms 16.14
> Double512Vector.EXPM1 43.39 318.61 ops/ms 7.34
> Double512Vector.HYPOT 73.87 1502.72 ops/ms 20.34
> Double512Vector.LOG 74.84 996.00 ops/ms 13.31
> Double512Vector.LOG10 71.12 1046.52 ops/ms 14.72
> Double512Vector.LOG1P 19.75 776.87 ops/ms 39.34
> Double512Vector.POW 37.42 384.13 ops/ms 10.26
> Double512Vector.SIN 59.74 728.45 ops/ms 12.19
> Double512Vector.SINH 29.47 143.38 ops/ms 4.87
> Double512Vector.TAN 46.20 587.21 ops/ms 12.71
> Double512Vector.TANH 57.36 495.42 ops/ms 8.64
> Double64Vector.ACOS 24.04 73.67 ops/ms 3.06
> Double64Vector.ASIN 23.78 75.11 ops/ms 3.16
> Double64Vector.ATAN 14.14 62.81 ops/ms 4.44
> Double64Vector.ATAN2 10.38 44.43 ops/ms 4.28
> Double64Vector.CBRT 16.47 107.50 ops/ms 6.53
> Double64Vector.COS 23.42 152.01 ops/ms 6.49
> Double64Vector.COSH 17.34 113.34 ops/ms 6.54
> Double64Vector.EXP 27.08 203.53 ops/ms 7.52
> Double64Vector.EXPM1 18.77 96.73 ops/ms 5.15
> Double64Vector.HYPOT 18.54 103.62 ops/ms 5.59
> Double64Vector.LOG 26.75 142.63 ops/ms 5.33
> Double64Vector.LOG10 25.85 139.71 ops/ms 5.40
> Double64Vector.LOG1P 13.26 97.94 ops/ms 7.38
> Double64Vector.SIN 23.28 146.

Re: RFR: 8268151: Vector API toShuffle optimization

2021-06-03 Thread Sandhya Viswanathan
On Thu, 3 Jun 2021 02:27:35 GMT, Xiaohong Gong  wrote:

>> The Vector API toShuffle method can be optimized  using existing vector 
>> conversion intrinsic.
>> 
>> The following changes are made:
>> 1) vector.toShuffle java implementation is changed to call 
>> VectorSupport.convert.
>> 2) The conversion intrinsic (inline_vector_convert()) in 
>> vectorIntrinsics.cpp is changed to allow shuffle as a destination type.
>> 3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
>> vectorIntrinsics.cpp now explicitly generates conversion node instead of 
>> performing conversion during unbox. This is to remove unnecessary boxing 
>> during back to back vector.toShuffle and shuffle.toVector calls. 
>> 
>> Best Regards,
>> Sandhya
>
> src/hotspot/share/opto/vectornode.cpp line 1246:
> 
>> 1244:   return new VectorLoadMaskNode(value, out_vt);
>> 1245: } else if (is_vector_shuffle) {
>> 1246:   if (!is_shuffle_to_vector()) {
> 
> Hi @sviswa7 , thanks for this change! I'm just curious whether 
> `is_shuffle_to_vector()` is still needed for `VectorUnboxNode` with this 
> change? It seems this flag can be removed, doesn't it?

@XiaohongGong is_shuffle_to_vector is still needed as we shouldn't generate 
VectorLoadShuffleNode for shuffle.toVector.

-

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v17]

2021-06-03 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request with a new target base due to 
a merge or a rebase. The pull request now contains 21 commits:

 - Merge master
 - update javadoc
 - correct javadoc
 - Javadoc changes
 - correct ppc.ad
 - Merge master
 - Commit missing changes
 - Implement Vladimir Ivanov and Paul Sandoz review comments
 - fix 32-bit build
 - Add comments explaining naming convention
 - ... and 11 more: https://git.openjdk.java.net/jdk/compare/52d8215a...03ac3197

-

Changes: https://git.openjdk.java.net/jdk/pull/3638/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=3638=16
  Stats: 416073 lines in 119 files changed: 415886 ins; 124 del; 63 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


RFR: 8268151: Vector API toShuffle optimization

2021-06-02 Thread Sandhya Viswanathan
The Vector API toShuffle method can be optimized  using existing vector 
conversion intrinsic.

The following changes are made:
1) vector.toShuffle java implementation is changed to call 
VectorSupport.convert.
2) The conversion intrinsic (inline_vector_convert()) in vectorIntrinsics.cpp 
is changed to allow shuffle as a destination type.
3) The shuffle.toVector intrinsic (inline_vector_shuffle_to_vector()) in 
vectorIntrinsics.cpp now explicitly generates conversion node instead of 
performing conversion during unbox. This is to remove unnecessary boxing during 
back to back vector.toShuffle and shuffle.toVector calls. 

Best Regards,
Sandhya

-

Commit messages:
 - toShuffle optimization

Changes: https://git.openjdk.java.net/jdk/pull/4326/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=4326=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8268151
  Stats: 393 lines in 34 files changed: 314 ins; 42 del; 37 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4326.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4326/head:pull/4326

PR: https://git.openjdk.java.net/jdk/pull/4326


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v16]

2021-06-02 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  update javadoc

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/e5208a18..b229e8b4

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=15
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=14-15

  Stats: 18 lines in 1 file changed: 0 ins; 0 del; 18 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v15]

2021-05-25 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  correct javadoc

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/6cd50248..e5208a18

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=14
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=13-14

  Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v14]

2021-05-25 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Javadoc changes

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/4d59af0a..6cd50248

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=13
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=12-13

  Stats: 58 lines in 1 file changed: 38 ins; 0 del; 20 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Integrated: 8267190: Optimize Vector API test operations

2021-05-21 Thread Sandhya Viswanathan
On Fri, 14 May 2021 23:58:38 GMT, Sandhya Viswanathan 
 wrote:

> Vector API test operations (IS_DEFAULT, IS_FINITE, IS_INFINITE, IS_NAN and 
> IS_NEGATIVE) are computed in three steps:
> 1) reinterpreting the floating point vectors as integral vectors (int/long)
> 2) perform the test in integer domain to get a int/long mask
> 3) reinterpret the int/long mask as float/double mask
> Step 3) currently is very slow. It can be optimized by modifying the Java 
> code to utilize the existing reinterpret intrinsic.
> 
> For the VectorTestPerf attached to the JBS for JDK-8267190, the performance 
> improves as follows:
> 
> Base:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 223.156 ± 90.452 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 223.841 ± 91.685 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 224.561 ± 83.890 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 223.777 ± 70.629 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 218.392 ± 79.806 ops/ms
> 
> With patch:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 8812.357 ± 40.477 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 7425.739 ± 296.622 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 8932.730 ± 269.988 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 8574.872 ± 498.649 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 8838.400 ± 11.849 ops/ms
> 
> Best Regards,
> Sandhya

This pull request has now been integrated.

Changeset: 8f10c5a8
Author:Sandhya Viswanathan 
URL:   
https://git.openjdk.java.net/jdk/commit/8f10c5a8900517cfa04256eab909e18535086b98
Stats: 1274 lines in 32 files changed: 652 ins; 279 del; 343 mod

8267190: Optimize Vector API test operations

Reviewed-by: psandoz, kvn

-

PR: https://git.openjdk.java.net/jdk/pull/4039


Re: RFR: 8267190: Optimize Vector API test operations [v3]

2021-05-20 Thread Sandhya Viswanathan
On Thu, 20 May 2021 23:19:01 GMT, Sandhya Viswanathan 
 wrote:

>> Vector API test operations (IS_DEFAULT, IS_FINITE, IS_INFINITE, IS_NAN and 
>> IS_NEGATIVE) are computed in three steps:
>> 1) reinterpreting the floating point vectors as integral vectors (int/long)
>> 2) perform the test in integer domain to get a int/long mask
>> 3) reinterpret the int/long mask as float/double mask
>> Step 3) currently is very slow. It can be optimized by modifying the Java 
>> code to utilize the existing reinterpret intrinsic.
>> 
>> For the VectorTestPerf attached to the JBS for JDK-8267190, the performance 
>> improves as follows:
>> 
>> Base:
>> Benchmark (size) Mode Cnt Score Error Units
>> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 223.156 ± 90.452 ops/ms
>> VectorTestPerf.IS_FINITE 1024 thrpt 5 223.841 ± 91.685 ops/ms
>> VectorTestPerf.IS_INFINITE 1024 thrpt 5 224.561 ± 83.890 ops/ms
>> VectorTestPerf.IS_NAN 1024 thrpt 5 223.777 ± 70.629 ops/ms
>> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 218.392 ± 79.806 ops/ms
>> 
>> With patch:
>> Benchmark (size) Mode Cnt Score Error Units
>> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 8812.357 ± 40.477 ops/ms
>> VectorTestPerf.IS_FINITE 1024 thrpt 5 7425.739 ± 296.622 ops/ms
>> VectorTestPerf.IS_INFINITE 1024 thrpt 5 8932.730 ± 269.988 ops/ms
>> VectorTestPerf.IS_NAN 1024 thrpt 5 8574.872 ± 498.649 ops/ms
>> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 8838.400 ± 11.849 ops/ms
>> 
>> Best Regards,
>> Sandhya
>
> Sandhya Viswanathan has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Implement review comments

Thanks Paul. I have implemented these two suggestions as well.

If no objections from any one else, I plan to integrate this tomorrow, Friday 
May 21.

-

PR: https://git.openjdk.java.net/jdk/pull/4039


Re: RFR: 8267190: Optimize Vector API test operations [v3]

2021-05-20 Thread Sandhya Viswanathan
> Vector API test operations (IS_DEFAULT, IS_FINITE, IS_INFINITE, IS_NAN and 
> IS_NEGATIVE) are computed in three steps:
> 1) reinterpreting the floating point vectors as integral vectors (int/long)
> 2) perform the test in integer domain to get a int/long mask
> 3) reinterpret the int/long mask as float/double mask
> Step 3) currently is very slow. It can be optimized by modifying the Java 
> code to utilize the existing reinterpret intrinsic.
> 
> For the VectorTestPerf attached to the JBS for JDK-8267190, the performance 
> improves as follows:
> 
> Base:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 223.156 ± 90.452 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 223.841 ± 91.685 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 224.561 ± 83.890 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 223.777 ± 70.629 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 218.392 ± 79.806 ops/ms
> 
> With patch:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 8812.357 ± 40.477 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 7425.739 ± 296.622 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 8932.730 ± 269.988 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 8574.872 ± 498.649 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 8838.400 ± 11.849 ops/ms
> 
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Implement review comments

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4039/files
  - new: https://git.openjdk.java.net/jdk/pull/4039/files/b506fc45..f318c0ee

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4039=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4039=01-02

  Stats: 372 lines in 31 files changed: 31 ins; 62 del; 279 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4039.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4039/head:pull/4039

PR: https://git.openjdk.java.net/jdk/pull/4039


Re: RFR: 8267190: Optimize Vector API test operations [v2]

2021-05-19 Thread Sandhya Viswanathan
> Vector API test operations (IS_DEFAULT, IS_FINITE, IS_INFINITE, IS_NAN and 
> IS_NEGATIVE) are computed in three steps:
> 1) reinterpreting the floating point vectors as integral vectors (int/long)
> 2) perform the test in integer domain to get a int/long mask
> 3) reinterpret the int/long mask as float/double mask
> Step 3) currently is very slow. It can be optimized by modifying the Java 
> code to utilize the existing reinterpret intrinsic.
> 
> For the VectorTestPerf attached to the JBS for JDK-8267190, the performance 
> improves as follows:
> 
> Base:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 223.156 ± 90.452 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 223.841 ± 91.685 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 224.561 ± 83.890 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 223.777 ± 70.629 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 218.392 ± 79.806 ops/ms
> 
> With patch:
> Benchmark (size) Mode Cnt Score Error Units
> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 8812.357 ± 40.477 ops/ms
> VectorTestPerf.IS_FINITE 1024 thrpt 5 7425.739 ± 296.622 ops/ms
> VectorTestPerf.IS_INFINITE 1024 thrpt 5 8932.730 ± 269.988 ops/ms
> VectorTestPerf.IS_NAN 1024 thrpt 5 8574.872 ± 498.649 ops/ms
> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 8838.400 ± 11.849 ops/ms
> 
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Implement Paul's review comments

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4039/files
  - new: https://git.openjdk.java.net/jdk/pull/4039/files/bb0d4000..b506fc45

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4039=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4039=00-01

  Stats: 806 lines in 31 files changed: 0 ins; 310 del; 496 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4039.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4039/head:pull/4039

PR: https://git.openjdk.java.net/jdk/pull/4039


Re: RFR: 8267190: Optimize Vector API test operations

2021-05-19 Thread Sandhya Viswanathan
On Wed, 19 May 2021 16:51:33 GMT, Paul Sandoz  wrote:

>> Vector API test operations (IS_DEFAULT, IS_FINITE, IS_INFINITE, IS_NAN and 
>> IS_NEGATIVE) are computed in three steps:
>> 1) reinterpreting the floating point vectors as integral vectors (int/long)
>> 2) perform the test in integer domain to get a int/long mask
>> 3) reinterpret the int/long mask as float/double mask
>> Step 3) currently is very slow. It can be optimized by modifying the Java 
>> code to utilize the existing reinterpret intrinsic.
>> 
>> For the VectorTestPerf attached to the JBS for JDK-8267190, the performance 
>> improves as follows:
>> 
>> Base:
>> Benchmark (size) Mode Cnt Score Error Units
>> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 223.156 ± 90.452 ops/ms
>> VectorTestPerf.IS_FINITE 1024 thrpt 5 223.841 ± 91.685 ops/ms
>> VectorTestPerf.IS_INFINITE 1024 thrpt 5 224.561 ± 83.890 ops/ms
>> VectorTestPerf.IS_NAN 1024 thrpt 5 223.777 ± 70.629 ops/ms
>> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 218.392 ± 79.806 ops/ms
>> 
>> With patch:
>> Benchmark (size) Mode Cnt Score Error Units
>> VectorTestPerf.IS_DEFAULT 1024 thrpt 5 8812.357 ± 40.477 ops/ms
>> VectorTestPerf.IS_FINITE 1024 thrpt 5 7425.739 ± 296.622 ops/ms
>> VectorTestPerf.IS_INFINITE 1024 thrpt 5 8932.730 ± 269.988 ops/ms
>> VectorTestPerf.IS_NAN 1024 thrpt 5 8574.872 ± 498.649 ops/ms
>> VectorTestPerf.IS_NEGATIVE 1024 thrpt 5 8838.400 ± 11.849 ops/ms
>> 
>> Best Regards,
>> Sandhya
>
> Tier 1 to 3 tests pass on supported platforms

@PaulSandoz @vnkozlov Thanks a lot for the review. 
Paul, I have implemented your review comments. I also changed the switch to 
switch expression. Please take a look.

-

PR: https://git.openjdk.java.net/jdk/pull/4039


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v13]

2021-05-19 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  correct ppc.ad

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/7b959b67..4d59af0a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=12
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=11-12

  Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v12]

2021-05-19 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request with a new target base due to 
a merge or a rebase. The pull request now contains 16 commits:

 - Merge master
 - Commit missing changes
 - Implement Vladimir Ivanov and Paul Sandoz review comments
 - fix 32-bit build
 - Add comments explaining naming convention
 - jcheck fixes
 - Print intrinsic fix
 - Implement review comments
 - Add missing Lib.gmk
 - Merge master
 - ... and 6 more: https://git.openjdk.java.net/jdk/compare/b961f253...7b959b67

-

Changes: https://git.openjdk.java.net/jdk/pull/3638/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=3638=11
  Stats: 416021 lines in 119 files changed: 415854 ins; 124 del; 43 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v2]

2021-05-19 Thread Sandhya Viswanathan
On Wed, 19 May 2021 22:02:14 GMT, Paul Sandoz  wrote:

>> Tier 1 to 3 tests pass for the default set of build profiles.
>
>> Thanks a lot for the review @PaulSandoz @iwanowww @erikj79.
>> Paul and Vladimir, I have implemented your review comments. Please take a 
>> look.
> 
> `case VECTOR_OP_OR` is still present.

@PaulSandoz Thanks for pointing that out. I had missed git add for some of the 
files.

-

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v11]

2021-05-19 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Commit missing changes

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/0b4a1c9c..1b0367ac

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=10
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=09-10

  Stats: 55 lines in 16 files changed: 2 ins; 42 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v2]

2021-05-19 Thread Sandhya Viswanathan
On Mon, 3 May 2021 21:41:26 GMT, Paul Sandoz  wrote:

>> Sandhya Viswanathan has updated the pull request with a new target base due 
>> to a merge or a rebase. The pull request now contains six commits:
>> 
>>  - Merge master
>>  - remove whitespace
>>  - Merge master
>>  - Small fix
>>  - cleanup
>>  - x86 short vector math optimization for Vector API
>
> Tier 1 to 3 tests pass for the default set of build profiles.

Thanks a lot for the review @PaulSandoz @iwanowww @erikj79.
Paul and Vladimir, I have implemented your review comments. Please take a look.

-

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v10]

2021-05-19 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Implement Vladimir Ivanov and Paul Sandoz review comments

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/f7e39913..0b4a1c9c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=09
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=08-09

  Stats: 45 lines in 1 file changed: 0 ins; 45 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v9]

2021-05-18 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  fix 32-bit build

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/45f20a34..f7e39913

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=08
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=07-08

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v8]

2021-05-18 Thread Sandhya Viswanathan
On Wed, 19 May 2021 00:58:15 GMT, Sandhya Viswanathan 
 wrote:

>> This PR contains Short Vector Math Library support related changes for 
>> [JEP-414 Vector API (Second Incubator)](https://openjdk.java.net/jeps/414), 
>> in preparation for when targeted.
>> 
>> Intel Short Vector Math Library (SVML) based intrinsics in native x86 
>> assembly provide optimized implementation for Vector API transcendental and 
>> trigonometric methods.
>> These methods are built into a separate library instead of being part of 
>> libjvm.so or jvm.dll.
>> 
>> The following changes are made:
>>The source for these methods is placed in the jdk.incubator.vector module 
>> under src/jdk.incubator.vector/linux/native/libsvml and 
>> src/jdk.incubator.vector/windows/native/libsvml.
>>The assembly source files are named as “*.S” and include files are named 
>> as “*.S.inc”.
>>The corresponding build script is placed at 
>> make/modules/jdk.incubator.vector/Lib.gmk.
>>Changes are made to build system to support dependency tracking for 
>> assembly files with includes.
>>The built native libraries (libsvml.so/svml.dll) are placed in bin 
>> directory of JDK on Windows and lib directory of JDK on Linux.
>>The C2 JIT uses the dll_load and dll_lookup to get the addresses of 
>> optimized methods from this library.
>> 
>> Build system changes and module library build scripts are contributed by 
>> Magnus (magnus.ihse.bur...@oracle.com).
>> 
>> Looking forward to your review and feedback.
>> 
>> Performance:
>> Micro benchmark Base Optimized Unit Gain(Optimized/Base)
>> Double128Vector.ACOS 45.91 87.34 ops/ms 1.90
>> Double128Vector.ASIN 45.06 92.36 ops/ms 2.05
>> Double128Vector.ATAN 19.92 118.36 ops/ms 5.94
>> Double128Vector.ATAN2 15.24 88.17 ops/ms 5.79
>> Double128Vector.CBRT 45.77 208.36 ops/ms 4.55
>> Double128Vector.COS 49.94 245.89 ops/ms 4.92
>> Double128Vector.COSH 26.91 126.00 ops/ms 4.68
>> Double128Vector.EXP 71.64 379.65 ops/ms 5.30
>> Double128Vector.EXPM1 35.95 150.37 ops/ms 4.18
>> Double128Vector.HYPOT 50.67 174.10 ops/ms 3.44
>> Double128Vector.LOG 61.95 279.84 ops/ms 4.52
>> Double128Vector.LOG10 59.34 239.05 ops/ms 4.03
>> Double128Vector.LOG1P 18.56 200.32 ops/ms 10.79
>> Double128Vector.SIN 49.36 240.79 ops/ms 4.88
>> Double128Vector.SINH 26.59 103.75 ops/ms 3.90
>> Double128Vector.TAN 41.05 152.39 ops/ms 3.71
>> Double128Vector.TANH 45.29 169.53 ops/ms 3.74
>> Double256Vector.ACOS 54.21 106.39 ops/ms 1.96
>> Double256Vector.ASIN 53.60 107.99 ops/ms 2.01
>> Double256Vector.ATAN 21.53 189.11 ops/ms 8.78
>> Double256Vector.ATAN2 16.67 140.76 ops/ms 8.44
>> Double256Vector.CBRT 56.45 397.13 ops/ms 7.04
>> Double256Vector.COS 58.26 389.77 ops/ms 6.69
>> Double256Vector.COSH 29.44 151.11 ops/ms 5.13
>> Double256Vector.EXP 86.67 564.68 ops/ms 6.52
>> Double256Vector.EXPM1 41.96 201.28 ops/ms 4.80
>> Double256Vector.HYPOT 66.18 305.74 ops/ms 4.62
>> Double256Vector.LOG 71.52 394.90 ops/ms 5.52
>> Double256Vector.LOG10 65.43 362.32 ops/ms 5.54
>> Double256Vector.LOG1P 19.99 300.88 ops/ms 15.05
>> Double256Vector.SIN 57.06 380.98 ops/ms 6.68
>> Double256Vector.SINH 29.40 117.37 ops/ms 3.99
>> Double256Vector.TAN 44.90 279.90 ops/ms 6.23
>> Double256Vector.TANH 54.08 274.71 ops/ms 5.08
>> Double512Vector.ACOS 55.65 687.54 ops/ms 12.35
>> Double512Vector.ASIN 57.31 777.72 ops/ms 13.57
>> Double512Vector.ATAN 21.42 729.21 ops/ms 34.04
>> Double512Vector.ATAN2 16.37 414.33 ops/ms 25.32
>> Double512Vector.CBRT 56.78 834.38 ops/ms 14.69
>> Double512Vector.COS 59.88 837.04 ops/ms 13.98
>> Double512Vector.COSH 30.34 172.76 ops/ms 5.70
>> Double512Vector.EXP 99.66 1608.12 ops/ms 16.14
>> Double512Vector.EXPM1 43.39 318.61 ops/ms 7.34
>> Double512Vector.HYPOT 73.87 1502.72 ops/ms 20.34
>> Double512Vector.LOG 74.84 996.00 ops/ms 13.31
>> Double512Vector.LOG10 71.12 1046.52 ops/ms 14.72
>> Double512Vector.LOG1P 19.75 776.87 ops/ms 39.34
>> Double512Vector.POW 37.42 384.13 ops/ms 10.26
>> Double512Vector.SIN 59.74 728.45 ops/ms 12.19
>> Double512Vector.SINH 29.47 143.38 ops/ms 4.87
>> Double512Vector.TAN 46.20 587.21 ops/ms 12.71
>> Double512Vector.TANH 57.36 495.42 ops/ms 8.64
>> Double64Vector.ACOS 24.04 73.67 ops/ms 3.06
>> Double64Vector.ASIN 23.78 75.11 ops/ms 3.16
>> Double64Vector.ATAN 14.14 62.81 ops/ms 4.44
>> Double64Vector.ATAN2 10.38 44.43 ops/ms 4.28
>> Double64Vector.CBRT 16.47 107.50 ops/ms 6.53
>> Double64Vector.COS 23.42 152.01 ops/ms 6.49
>> Double64Vector

Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v7]

2021-05-18 Thread Sandhya Viswanathan
On Wed, 19 May 2021 00:26:48 GMT, Vladimir Kozlov  wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   jcheck fixes
>
> This is much much better! Thank you for changing it. I am only asking now to 
> add comment explaining names.

@vnkozlov I have added comments explaining naming convention. Please let me 
know if this looks ok.

-

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v8]

2021-05-18 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  Add comments explaining naming convention

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/0d1d0382..45f20a34

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=07
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=06-07

  Stats: 15 lines in 1 file changed: 15 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v7]

2021-05-18 Thread Sandhya Viswanathan
59 ops/ms 5.03
> Double64Vector.TAN 21.00 86.43 ops/ms 4.12
> Double64Vector.TANH 23.75 111.35 ops/ms 4.69
> Float128Vector.ACOS 57.52 110.65 ops/ms 1.92
> Float128Vector.ASIN 57.15 117.95 ops/ms 2.06
> Float128Vector.ATAN 22.52 318.74 ops/ms 14.15
> Float128Vector.ATAN2 17.06 246.07 ops/ms 14.42
> Float128Vector.CBRT 29.72 443.74 ops/ms 14.93
> Float128Vector.COS 42.82 803.02 ops/ms 18.75
> Float128Vector.COSH 31.44 118.34 ops/ms 3.76
> Float128Vector.EXP 72.43 855.33 ops/ms 11.81
> Float128Vector.EXPM1 37.82 127.85 ops/ms 3.38
> Float128Vector.HYPOT 53.20 591.68 ops/ms 11.12
> Float128Vector.LOG 52.95 877.94 ops/ms 16.58
> Float128Vector.LOG10 49.26 603.72 ops/ms 12.26
> Float128Vector.LOG1P 20.89 430.59 ops/ms 20.61
> Float128Vector.SIN 43.38 745.31 ops/ms 17.18
> Float128Vector.SINH 31.11 112.91 ops/ms 3.63
> Float128Vector.TAN 37.25 332.13 ops/ms 8.92
> Float128Vector.TANH 57.63 453.77 ops/ms 7.87
> Float256Vector.ACOS 65.23 123.73 ops/ms 1.90
> Float256Vector.ASIN 63.41 132.86 ops/ms 2.10
> Float256Vector.ATAN 23.51 649.02 ops/ms 27.61
> Float256Vector.ATAN2 18.19 455.95 ops/ms 25.07
> Float256Vector.CBRT 45.99 594.81 ops/ms 12.93
> Float256Vector.COS 43.75 926.69 ops/ms 21.18
> Float256Vector.COSH 33.52 130.46 ops/ms 3.89
> Float256Vector.EXP 75.70 1366.72 ops/ms 18.05
> Float256Vector.EXPM1 39.00 149.72 ops/ms 3.84
> Float256Vector.HYPOT 52.91 1023.18 ops/ms 19.34
> Float256Vector.LOG 53.31 1545.77 ops/ms 29.00
> Float256Vector.LOG10 50.31 863.80 ops/ms 17.17
> Float256Vector.LOG1P 21.51 616.59 ops/ms 28.66
> Float256Vector.SIN 44.07 911.04 ops/ms 20.67
> Float256Vector.SINH 33.16 122.50 ops/ms 3.69
> Float256Vector.TAN 37.85 497.75 ops/ms 13.15
> Float256Vector.TANH 64.27 537.20 ops/ms 8.36
> Float512Vector.ACOS 67.33 1718.00 ops/ms 25.52
> Float512Vector.ASIN 66.12 1780.85 ops/ms 26.93
> Float512Vector.ATAN 22.63 1780.31 ops/ms 78.69
> Float512Vector.ATAN2 17.52 1113.93 ops/ms 63.57
> Float512Vector.CBRT 54.78 2087.58 ops/ms 38.11
> Float512Vector.COS 40.92 1567.93 ops/ms 38.32
> Float512Vector.COSH 33.42 138.36 ops/ms 4.14
> Float512Vector.EXP 70.51 3835.97 ops/ms 54.41
> Float512Vector.EXPM1 38.06 279.80 ops/ms 7.35
> Float512Vector.HYPOT 50.99 3287.55 ops/ms 64.47
> Float512Vector.LOG 49.61 3156.99 ops/ms 63.64
> Float512Vector.LOG10 46.94 2489.16 ops/ms 53.02
> Float512Vector.LOG1P 20.66 1689.86 ops/ms 81.81
> Float512Vector.POW 32.73 1015.85 ops/ms 31.04
> Float512Vector.SIN 41.17 1587.71 ops/ms 38.56
> Float512Vector.SINH 33.05 129.39 ops/ms 3.91
> Float512Vector.TAN 35.60 1336.11 ops/ms 37.53
> Float512Vector.TANH 65.77 2295.28 ops/ms 34.90
> Float64Vector.ACOS 48.41 89.34 ops/ms 1.85
> Float64Vector.ASIN 47.30 95.72 ops/ms 2.02
> Float64Vector.ATAN 20.62 49.45 ops/ms 2.40
> Float64Vector.ATAN2 15.95 112.35 ops/ms 7.04
> Float64Vector.CBRT 24.03 134.57 ops/ms 5.60
> Float64Vector.COS 44.28 394.33 ops/ms 8.91
> Float64Vector.COSH 28.35 95.27 ops/ms 3.36
> Float64Vector.EXP 65.80 486.37 ops/ms 7.39
> Float64Vector.EXPM1 34.61 85.99 ops/ms 2.48
> Float64Vector.HYPOT 50.40 147.82 ops/ms 2.93
> Float64Vector.LOG 51.93 163.25 ops/ms 3.14
> Float64Vector.LOG10 49.53 147.98 ops/ms 2.99
> Float64Vector.LOG1P 19.20 206.81 ops/ms 10.77
> Float64Vector.SIN 44.41 382.09 ops/ms 8.60
> Float64Vector.SINH 28.20 90.68 ops/ms 3.22
> Float64Vector.TAN 36.29 160.89 ops/ms 4.43
> Float64Vector.TANH 47.65 214.04 ops/ms 4.49

Sandhya Viswanathan has updated the pull request incrementally with one 
additional commit since the last revision:

  jcheck fixes

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3638/files
  - new: https://git.openjdk.java.net/jdk/pull/3638/files/11528426..0d1d0382

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=3638=06
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=3638=05-06

  Stats: 4 lines in 3 files changed: 0 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3638.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3638/head:pull/3638

PR: https://git.openjdk.java.net/jdk/pull/3638


Re: RFR: 8265783: Create a separate library for x86 Intel SVML assembly intrinsics [v6]

2021-05-18 Thread Sandhya Viswanathan
On Tue, 18 May 2021 23:43:13 GMT, Sandhya Viswanathan 
 wrote:

>> This PR contains Short Vector Math Library support related changes for 
>> [JEP-414 Vector API (Second Incubator)](https://openjdk.java.net/jeps/414), 
>> in preparation for when targeted.
>> 
>> Intel Short Vector Math Library (SVML) based intrinsics in native x86 
>> assembly provide optimized implementation for Vector API transcendental and 
>> trigonometric methods.
>> These methods are built into a separate library instead of being part of 
>> libjvm.so or jvm.dll.
>> 
>> The following changes are made:
>>The source for these methods is placed in the jdk.incubator.vector module 
>> under src/jdk.incubator.vector/linux/native/libsvml and 
>> src/jdk.incubator.vector/windows/native/libsvml.
>>The assembly source files are named as “*.S” and include files are named 
>> as “*.S.inc”.
>>The corresponding build script is placed at 
>> make/modules/jdk.incubator.vector/Lib.gmk.
>>Changes are made to build system to support dependency tracking for 
>> assembly files with includes.
>>The built native libraries (libsvml.so/svml.dll) are placed in bin 
>> directory of JDK on Windows and lib directory of JDK on Linux.
>>The C2 JIT uses the dll_load and dll_lookup to get the addresses of 
>> optimized methods from this library.
>> 
>> Build system changes and module library build scripts are contributed by 
>> Magnus (magnus.ihse.bur...@oracle.com).
>> 
>> Looking forward to your review and feedback.
>> 
>> Performance:
>> Micro benchmark Base Optimized Unit Gain(Optimized/Base)
>> Double128Vector.ACOS 45.91 87.34 ops/ms 1.90
>> Double128Vector.ASIN 45.06 92.36 ops/ms 2.05
>> Double128Vector.ATAN 19.92 118.36 ops/ms 5.94
>> Double128Vector.ATAN2 15.24 88.17 ops/ms 5.79
>> Double128Vector.CBRT 45.77 208.36 ops/ms 4.55
>> Double128Vector.COS 49.94 245.89 ops/ms 4.92
>> Double128Vector.COSH 26.91 126.00 ops/ms 4.68
>> Double128Vector.EXP 71.64 379.65 ops/ms 5.30
>> Double128Vector.EXPM1 35.95 150.37 ops/ms 4.18
>> Double128Vector.HYPOT 50.67 174.10 ops/ms 3.44
>> Double128Vector.LOG 61.95 279.84 ops/ms 4.52
>> Double128Vector.LOG10 59.34 239.05 ops/ms 4.03
>> Double128Vector.LOG1P 18.56 200.32 ops/ms 10.79
>> Double128Vector.SIN 49.36 240.79 ops/ms 4.88
>> Double128Vector.SINH 26.59 103.75 ops/ms 3.90
>> Double128Vector.TAN 41.05 152.39 ops/ms 3.71
>> Double128Vector.TANH 45.29 169.53 ops/ms 3.74
>> Double256Vector.ACOS 54.21 106.39 ops/ms 1.96
>> Double256Vector.ASIN 53.60 107.99 ops/ms 2.01
>> Double256Vector.ATAN 21.53 189.11 ops/ms 8.78
>> Double256Vector.ATAN2 16.67 140.76 ops/ms 8.44
>> Double256Vector.CBRT 56.45 397.13 ops/ms 7.04
>> Double256Vector.COS 58.26 389.77 ops/ms 6.69
>> Double256Vector.COSH 29.44 151.11 ops/ms 5.13
>> Double256Vector.EXP 86.67 564.68 ops/ms 6.52
>> Double256Vector.EXPM1 41.96 201.28 ops/ms 4.80
>> Double256Vector.HYPOT 66.18 305.74 ops/ms 4.62
>> Double256Vector.LOG 71.52 394.90 ops/ms 5.52
>> Double256Vector.LOG10 65.43 362.32 ops/ms 5.54
>> Double256Vector.LOG1P 19.99 300.88 ops/ms 15.05
>> Double256Vector.SIN 57.06 380.98 ops/ms 6.68
>> Double256Vector.SINH 29.40 117.37 ops/ms 3.99
>> Double256Vector.TAN 44.90 279.90 ops/ms 6.23
>> Double256Vector.TANH 54.08 274.71 ops/ms 5.08
>> Double512Vector.ACOS 55.65 687.54 ops/ms 12.35
>> Double512Vector.ASIN 57.31 777.72 ops/ms 13.57
>> Double512Vector.ATAN 21.42 729.21 ops/ms 34.04
>> Double512Vector.ATAN2 16.37 414.33 ops/ms 25.32
>> Double512Vector.CBRT 56.78 834.38 ops/ms 14.69
>> Double512Vector.COS 59.88 837.04 ops/ms 13.98
>> Double512Vector.COSH 30.34 172.76 ops/ms 5.70
>> Double512Vector.EXP 99.66 1608.12 ops/ms 16.14
>> Double512Vector.EXPM1 43.39 318.61 ops/ms 7.34
>> Double512Vector.HYPOT 73.87 1502.72 ops/ms 20.34
>> Double512Vector.LOG 74.84 996.00 ops/ms 13.31
>> Double512Vector.LOG10 71.12 1046.52 ops/ms 14.72
>> Double512Vector.LOG1P 19.75 776.87 ops/ms 39.34
>> Double512Vector.POW 37.42 384.13 ops/ms 10.26
>> Double512Vector.SIN 59.74 728.45 ops/ms 12.19
>> Double512Vector.SINH 29.47 143.38 ops/ms 4.87
>> Double512Vector.TAN 46.20 587.21 ops/ms 12.71
>> Double512Vector.TANH 57.36 495.42 ops/ms 8.64
>> Double64Vector.ACOS 24.04 73.67 ops/ms 3.06
>> Double64Vector.ASIN 23.78 75.11 ops/ms 3.16
>> Double64Vector.ATAN 14.14 62.81 ops/ms 4.44
>> Double64Vector.ATAN2 10.38 44.43 ops/ms 4.28
>> Double64Vector.CBRT 16.47 107.50 ops/ms 6.53
>> Double64Vector.COS 23.42 152.01 ops/ms 6.49
>> Double64Vector

  1   2   >