Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Mon, 28 Mar 2022 07:43:29 GMT, Jie Fu wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Add a superclass for vector negation > > src/hotspot/share/opto/vectornode.cpp line 1592: > >> 1590: >> 1591: // Generate other vector nodes to implement the masked/non-masked >> vector negation. >> 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, >> BasicType bt, PhaseGVN* phase, bool is_predicated) { > > Shall we move this declaration in `class NegVNode` since it is only used by > NegVNode::Ideal ? I think it can be. Thanks for pointing out this. I will change this later. - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong wrote: >> The current vector `"NEG"` is implemented with substraction a vector by zero >> in case the architecture does not support the negation instruction. And to >> fit the predicate feature for architectures that support it, the masked >> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They >> both can be optimized to a single negation instruction for ARM SVE. >> And so does the non-masked "NEG" for NEON. Besides, implementing the masked >> "NEG" with substraction for architectures that support neither negation >> instruction nor predicate feature can also save several instructions than >> the current pattern. >> >> To optimize the VectorAPI negation, this patch moves the implementation from >> Java side to hotspot. The compiler will generate different nodes according >> to the architecture: >> - Generate the (predicated) negation node if architecture supports it, >> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. >> - Generate `"zero.sub(v, m)"` for masked operation if the architecture >> does not have predicate feature, otherwise generate the original pattern >> `"v.xor(-1, m).add(1, m)"`. >> >> So with this patch, the following transformations are applied: >> >> For non-masked negation with NEON: >> >> moviv16.4s, #0x0 >> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s >> >> and with SVE: >> >> mov z16.s, #0 >> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s >> >> For masked negation with NEON: >> >> moviv17.4s, #0x1 >> mvn v19.16b, v18.16b >> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s >> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b >> add v19.4s, v20.4s, v17.4s >> mov v18.16b, v16.16b >> bsl v18.16b, v19.16b, v20.16b >> >> and with SVE: >> >> mov z16.s, #-1 >> mov z17.s, #1==> neg z16.s, p0/m, z16.s >> eor z18.s, p0/m, z18.s, z16.s >> add z18.s, p0/m, z18.s, z17.s >> >> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 >> machines(note that the non-masked negation benchmarks do not have any >> improvement on X86 since no instructions are changed): >> >> NEON: >> BenchmarkGain >> Byte128Vector.NEG1.029 >> Byte128Vector.NEGMasked 1.757 >> Short128Vector.NEG 1.041 >> Short128Vector.NEGMasked 1.659 >> Int128Vector.NEG 1.005 >> Int128Vector.NEGMasked 1.513 >> Long128Vector.NEG1.003 >> Long128Vector.NEGMasked 1.878 >> >> SVE with 512-bits: >> BenchmarkGain >> ByteMaxVector.NEG1.10 >> ByteMaxVector.NEGMasked 1.165 >> ShortMaxVector.NEG 1.056 >> ShortMaxVector.NEGMasked 1.195 >> IntMaxVector.NEG 1.002 >> IntMaxVector.NEGMasked 1.239 >> LongMaxVector.NEG1.031 >> LongMaxVector.NEGMasked 1.191 >> >> X86 (non AVX-512): >> BenchmarkGain >> ByteMaxVector.NEGMasked 1.254 >> ShortMaxVector.NEGMasked 1.359 >> IntMaxVector.NEGMasked 1.431 >> LongMaxVector.NEGMasked 1.989 >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 >> [2] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Add a superclass for vector negation src/hotspot/share/opto/vectornode.cpp line 1592: > 1590: > 1591: // Generate other vector nodes to implement the masked/non-masked > vector negation. > 1592: Node* VectorNode::degenerate_vector_integral_negate(Node* n, int vlen, > BasicType bt, PhaseGVN* phase, bool is_predicated) { Shall we move this declaration in `class NegVNode` since it is only used by NegVNode::Ideal ? - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Mon, 28 Mar 2022 07:40:48 GMT, Jie Fu wrote: >> The compiler can get the real type info from `Op_NegVI` that can also handle >> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs >> which also need more match rules in the ad files. >> >>> Is there any performance drop for byte/short negation operation if both of >>> them are handled as a NegVI vector? >> >> From the benchmark results I showed in the commit message, I didn't see not >> any performance drop for byte/short. Thanks! > >> The compiler can get the real type info from `Op_NegVI` that can also handle >> the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs >> which also need more match rules in the ad files. >> >> > Is there any performance drop for byte/short negation operation if both of >> > them are handled as a NegVI vector? >> >> From the benchmark results I showed in the commit message, I didn't see not >> any performance drop for byte/short. Thanks! > > There seems no vectorized negation instructions for {byte, short, int, long} > on x86, so this should be fine on x86. > I tested the patch on x86 and the performance number looks good. Thanks for doing this! Yeah, I think the performance for masked negation operations might improve on non avx-512 systems. - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Mon, 21 Mar 2022 01:19:57 GMT, Xiaohong Gong wrote: > The compiler can get the real type info from `Op_NegVI` that can also handle > the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs > which also need more match rules in the ad files. > > > Is there any performance drop for byte/short negation operation if both of > > them are handled as a NegVI vector? > > From the benchmark results I showed in the commit message, I didn't see not > any performance drop for byte/short. Thanks! There seems no vectorized negation instructions for {byte, short, int, long} on x86, so this should be fine on x86. I tested the patch on x86 and the performance number looks good. - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
On Tue, 22 Mar 2022 09:58:23 GMT, Xiaohong Gong wrote: >> The current vector `"NEG"` is implemented with substraction a vector by zero >> in case the architecture does not support the negation instruction. And to >> fit the predicate feature for architectures that support it, the masked >> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They >> both can be optimized to a single negation instruction for ARM SVE. >> And so does the non-masked "NEG" for NEON. Besides, implementing the masked >> "NEG" with substraction for architectures that support neither negation >> instruction nor predicate feature can also save several instructions than >> the current pattern. >> >> To optimize the VectorAPI negation, this patch moves the implementation from >> Java side to hotspot. The compiler will generate different nodes according >> to the architecture: >> - Generate the (predicated) negation node if architecture supports it, >> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. >> - Generate `"zero.sub(v, m)"` for masked operation if the architecture >> does not have predicate feature, otherwise generate the original pattern >> `"v.xor(-1, m).add(1, m)"`. >> >> So with this patch, the following transformations are applied: >> >> For non-masked negation with NEON: >> >> moviv16.4s, #0x0 >> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s >> >> and with SVE: >> >> mov z16.s, #0 >> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s >> >> For masked negation with NEON: >> >> moviv17.4s, #0x1 >> mvn v19.16b, v18.16b >> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s >> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b >> add v19.4s, v20.4s, v17.4s >> mov v18.16b, v16.16b >> bsl v18.16b, v19.16b, v20.16b >> >> and with SVE: >> >> mov z16.s, #-1 >> mov z17.s, #1==> neg z16.s, p0/m, z16.s >> eor z18.s, p0/m, z18.s, z16.s >> add z18.s, p0/m, z18.s, z17.s >> >> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 >> machines(note that the non-masked negation benchmarks do not have any >> improvement on X86 since no instructions are changed): >> >> NEON: >> BenchmarkGain >> Byte128Vector.NEG1.029 >> Byte128Vector.NEGMasked 1.757 >> Short128Vector.NEG 1.041 >> Short128Vector.NEGMasked 1.659 >> Int128Vector.NEG 1.005 >> Int128Vector.NEGMasked 1.513 >> Long128Vector.NEG1.003 >> Long128Vector.NEGMasked 1.878 >> >> SVE with 512-bits: >> BenchmarkGain >> ByteMaxVector.NEG1.10 >> ByteMaxVector.NEGMasked 1.165 >> ShortMaxVector.NEG 1.056 >> ShortMaxVector.NEGMasked 1.195 >> IntMaxVector.NEG 1.002 >> IntMaxVector.NEGMasked 1.239 >> LongMaxVector.NEG1.031 >> LongMaxVector.NEGMasked 1.191 >> >> X86 (non AVX-512): >> BenchmarkGain >> ByteMaxVector.NEGMasked 1.254 >> ShortMaxVector.NEGMasked 1.359 >> IntMaxVector.NEGMasked 1.431 >> LongMaxVector.NEGMasked 1.989 >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 >> [2] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Add a superclass for vector negation Hi @PaulSandoz @jatin-bhateja, could you please help to take a look at this PR? Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/7782
Re: RFR: 8282162: [vector] Optimize vector negation API [v2]
> The current vector `"NEG"` is implemented with substraction a vector by zero > in case the architecture does not support the negation instruction. And to > fit the predicate feature for architectures that support it, the masked > vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both > can be optimized to a single negation instruction for ARM SVE. > And so does the non-masked "NEG" for NEON. Besides, implementing the masked > "NEG" with substraction for architectures that support neither negation > instruction nor predicate feature can also save several instructions than the > current pattern. > > To optimize the VectorAPI negation, this patch moves the implementation from > Java side to hotspot. The compiler will generate different nodes according to > the architecture: > - Generate the (predicated) negation node if architecture supports it, > otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. > - Generate `"zero.sub(v, m)"` for masked operation if the architecture does > not have predicate feature, otherwise generate the original pattern > `"v.xor(-1, m).add(1, m)"`. > > So with this patch, the following transformations are applied: > > For non-masked negation with NEON: > > moviv16.4s, #0x0 > sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s > > and with SVE: > > mov z16.s, #0 > sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s > > For masked negation with NEON: > > moviv17.4s, #0x1 > mvn v19.16b, v18.16b > mov v20.16b, v16.16b ==> neg v18.4s, v17.4s > bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b > add v19.4s, v20.4s, v17.4s > mov v18.16b, v16.16b > bsl v18.16b, v19.16b, v20.16b > > and with SVE: > > mov z16.s, #-1 > mov z17.s, #1==> neg z16.s, p0/m, z16.s > eor z18.s, p0/m, z18.s, z16.s > add z18.s, p0/m, z18.s, z17.s > > Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 > machines(note that the non-masked negation benchmarks do not have any > improvement on X86 since no instructions are changed): > > NEON: > BenchmarkGain > Byte128Vector.NEG1.029 > Byte128Vector.NEGMasked 1.757 > Short128Vector.NEG 1.041 > Short128Vector.NEGMasked 1.659 > Int128Vector.NEG 1.005 > Int128Vector.NEGMasked 1.513 > Long128Vector.NEG1.003 > Long128Vector.NEGMasked 1.878 > > SVE with 512-bits: > BenchmarkGain > ByteMaxVector.NEG1.10 > ByteMaxVector.NEGMasked 1.165 > ShortMaxVector.NEG 1.056 > ShortMaxVector.NEGMasked 1.195 > IntMaxVector.NEG 1.002 > IntMaxVector.NEGMasked 1.239 > LongMaxVector.NEG1.031 > LongMaxVector.NEGMasked 1.191 > > X86 (non AVX-512): > BenchmarkGain > ByteMaxVector.NEGMasked 1.254 > ShortMaxVector.NEGMasked 1.359 > IntMaxVector.NEGMasked 1.431 > LongMaxVector.NEGMasked 1.989 > > [1] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 > [2] > https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Add a superclass for vector negation - Changes: - all: https://git.openjdk.java.net/jdk/pull/7782/files - new: https://git.openjdk.java.net/jdk/pull/7782/files/828866f8..97c8119a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7782&range=00-01 Stats: 64 lines in 4 files changed: 16 ins; 13 del; 35 mod Patch: https://git.openjdk.java.net/jdk/pull/7782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7782/head:pull/7782 PR: https://git.openjdk.java.net/jdk/pull/7782