Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Xiaohong Gong
On Mon, 28 Mar 2022 07:43:29 GMT, Jie Fu  wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add a superclass for vector negation
>
> src/hotspot/share/opto/vectornode.cpp line 1592:
> 
>> 1590: 
>> 1591: // Generate other vector nodes to implement the masked/non-masked 
>> vector negation.
>> 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, 
>> BasicType bt, PhaseGVN* phase, bool is_predicated) {
> 
> Shall we move this declaration in `class  NegVNode` since it is only used by  
> NegVNode::Ideal ?

I think it can be. Thanks for pointing out this. I will change this later.

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Jie Fu
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong  wrote:

>> The current vector `"NEG"` is implemented with substraction a vector by zero 
>> in case the architecture does not support the negation instruction. And to 
>> fit the predicate feature for architectures that support it, the masked 
>> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They 
>> both can be optimized to a single negation instruction for ARM SVE.
>> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
>> "NEG" with substraction for architectures that support neither negation 
>> instruction nor predicate feature can also save several instructions than 
>> the current pattern.
>> 
>> To optimize the VectorAPI negation, this patch moves the implementation from 
>> Java side to hotspot. The compiler will generate different nodes according 
>> to the architecture:
>>   - Generate the (predicated) negation node if architecture supports it, 
>> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture 
>> does not have predicate feature, otherwise generate the original pattern 
>> `"v.xor(-1, m).add(1, m)"`.
>> 
>> So with this patch, the following transformations are applied:
>> 
>> For non-masked negation with NEON:
>> 
>>   moviv16.4s, #0x0
>>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
>> 
>> and with SVE:
>> 
>>   mov z16.s, #0
>>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
>> 
>> For masked negation with NEON:
>> 
>>   moviv17.4s, #0x1
>>   mvn v19.16b, v18.16b
>>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>>   add v19.4s, v20.4s, v17.4s
>>   mov v18.16b, v16.16b
>>   bsl v18.16b, v19.16b, v20.16b
>> 
>> and with SVE:
>> 
>>   mov z16.s, #-1
>>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>>   eor z18.s, p0/m, z18.s, z16.s
>>   add z18.s, p0/m, z18.s, z17.s
>> 
>> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
>> machines(note that the non-masked negation benchmarks do not have any 
>> improvement on X86 since no instructions are changed):
>> 
>> NEON:
>> BenchmarkGain
>> Byte128Vector.NEG1.029
>> Byte128Vector.NEGMasked  1.757
>> Short128Vector.NEG   1.041
>> Short128Vector.NEGMasked 1.659
>> Int128Vector.NEG 1.005
>> Int128Vector.NEGMasked   1.513
>> Long128Vector.NEG1.003
>> Long128Vector.NEGMasked  1.878
>> 
>> SVE with 512-bits:
>> BenchmarkGain
>> ByteMaxVector.NEG1.10
>> ByteMaxVector.NEGMasked  1.165
>> ShortMaxVector.NEG   1.056
>> ShortMaxVector.NEGMasked 1.195
>> IntMaxVector.NEG 1.002
>> IntMaxVector.NEGMasked   1.239
>> LongMaxVector.NEG1.031
>> LongMaxVector.NEGMasked  1.191
>> 
>> X86 (non AVX-512):
>> BenchmarkGain
>> ByteMaxVector.NEGMasked  1.254
>> ShortMaxVector.NEGMasked 1.359
>> IntMaxVector.NEGMasked   1.431
>> LongMaxVector.NEGMasked  1.989
>> 
>> [1] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
>> [2] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add a superclass for vector negation

src/hotspot/share/opto/vectornode.cpp line 1592:

> 1590: 
> 1591: // Generate other vector nodes to implement the masked/non-masked 
> vector negation.
> 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, 
> BasicType bt, PhaseGVN* phase, bool is_predicated) {

Shall we move this declaration in `class  NegVNode` since it is only used by  
NegVNode::Ideal ?

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Xiaohong Gong
On Mon, 28 Mar 2022 07:40:48 GMT, Jie Fu  wrote:

>> The compiler can get the real type info from `Op_NegVI` that can also handle 
>> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
>> which also need more match rules in the ad files.
>> 
>>> Is there any performance drop for byte/short negation operation if both of 
>>> them are handled as a NegVI vector?
>> 
>> From the benchmark results I showed in the commit message, I didn't see not 
>> any performance drop for byte/short.  Thanks!
>
>> The compiler can get the real type info from `Op_NegVI` that can also handle 
>> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
>> which also need more match rules in the ad files.
>> 
>> > Is there any performance drop for byte/short negation operation if both of 
>> > them are handled as a NegVI vector?
>> 
>> From the benchmark results I showed in the commit message, I didn't see not 
>> any performance drop for byte/short. Thanks!
> 
> There seems no vectorized negation instructions for {byte, short, int, long} 
> on x86, so this should be fine on x86.
> I tested the patch on x86 and the performance number looks good.

Thanks for doing this! Yeah, I think the performance for masked negation 
operations might improve on non avx-512 systems.

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-28 Thread Jie Fu
On Mon, 21 Mar 2022 01:19:57 GMT, Xiaohong Gong  wrote:

> The compiler can get the real type info from `Op_NegVI` that can also handle 
> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs 
> which also need more match rules in the ad files.
> 
> > Is there any performance drop for byte/short negation operation if both of 
> > them are handled as a NegVI vector?
> 
> From the benchmark results I showed in the commit message, I didn't see not 
> any performance drop for byte/short. Thanks!

There seems no vectorized negation instructions for {byte, short, int, long} on 
x86, so this should be fine on x86.
I tested the patch on x86 and the performance number looks good.

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-23 Thread Xiaohong Gong
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong  wrote:

>> The current vector `"NEG"` is implemented with substraction a vector by zero 
>> in case the architecture does not support the negation instruction. And to 
>> fit the predicate feature for architectures that support it, the masked 
>> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They 
>> both can be optimized to a single negation instruction for ARM SVE.
>> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
>> "NEG" with substraction for architectures that support neither negation 
>> instruction nor predicate feature can also save several instructions than 
>> the current pattern.
>> 
>> To optimize the VectorAPI negation, this patch moves the implementation from 
>> Java side to hotspot. The compiler will generate different nodes according 
>> to the architecture:
>>   - Generate the (predicated) negation node if architecture supports it, 
>> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture 
>> does not have predicate feature, otherwise generate the original pattern 
>> `"v.xor(-1, m).add(1, m)"`.
>> 
>> So with this patch, the following transformations are applied:
>> 
>> For non-masked negation with NEON:
>> 
>>   moviv16.4s, #0x0
>>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
>> 
>> and with SVE:
>> 
>>   mov z16.s, #0
>>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
>> 
>> For masked negation with NEON:
>> 
>>   moviv17.4s, #0x1
>>   mvn v19.16b, v18.16b
>>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>>   add v19.4s, v20.4s, v17.4s
>>   mov v18.16b, v16.16b
>>   bsl v18.16b, v19.16b, v20.16b
>> 
>> and with SVE:
>> 
>>   mov z16.s, #-1
>>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>>   eor z18.s, p0/m, z18.s, z16.s
>>   add z18.s, p0/m, z18.s, z17.s
>> 
>> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
>> machines(note that the non-masked negation benchmarks do not have any 
>> improvement on X86 since no instructions are changed):
>> 
>> NEON:
>> BenchmarkGain
>> Byte128Vector.NEG1.029
>> Byte128Vector.NEGMasked  1.757
>> Short128Vector.NEG   1.041
>> Short128Vector.NEGMasked 1.659
>> Int128Vector.NEG 1.005
>> Int128Vector.NEGMasked   1.513
>> Long128Vector.NEG1.003
>> Long128Vector.NEGMasked  1.878
>> 
>> SVE with 512-bits:
>> BenchmarkGain
>> ByteMaxVector.NEG1.10
>> ByteMaxVector.NEGMasked  1.165
>> ShortMaxVector.NEG   1.056
>> ShortMaxVector.NEGMasked 1.195
>> IntMaxVector.NEG 1.002
>> IntMaxVector.NEGMasked   1.239
>> LongMaxVector.NEG1.031
>> LongMaxVector.NEGMasked  1.191
>> 
>> X86 (non AVX-512):
>> BenchmarkGain
>> ByteMaxVector.NEGMasked  1.254
>> ShortMaxVector.NEGMasked 1.359
>> IntMaxVector.NEGMasked   1.431
>> LongMaxVector.NEGMasked  1.989
>> 
>> [1] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
>> [2] 
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896
>
> Xiaohong Gong has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add a superclass for vector negation

Hi @PaulSandoz @jatin-bhateja, could you please help to take a look at this PR? 
Thanks so much!

-

PR: https://git.openjdk.java.net/jdk/pull/7782


Re: RFR: 8282162: [vector] Optimize vector negation API [v2]

2022-03-22 Thread Xiaohong Gong
> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   moviv16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s   ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s  ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   moviv17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> BenchmarkGain
> Byte128Vector.NEG1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG   1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG 1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> BenchmarkGain
> ByteMaxVector.NEG1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG   1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG 1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> BenchmarkGain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Add a superclass for vector negation

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7782/files
  - new: https://git.openjdk.java.net/jdk/pull/7782/files/828866f8..97c8119a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=00-01

  Stats: 64 lines in 4 files changed: 16 ins; 13 del; 35 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7782.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782

PR: https://git.openjdk.java.net/jdk/pull/7782