On Sat, 19 Mar 2022 02:34:55 GMT, Jie Fu <ji...@openjdk.org> wrote: >> The current vector `"NEG"` is implemented with substraction a vector by zero >> in case the architecture does not support the negation instruction. And to >> fit the predicate feature for architectures that support it, the masked >> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They >> both can be optimized to a single negation instruction for ARM SVE. >> And so does the non-masked "NEG" for NEON. Besides, implementing the masked >> "NEG" with substraction for architectures that support neither negation >> instruction nor predicate feature can also save several instructions than >> the current pattern. >> >> To optimize the VectorAPI negation, this patch moves the implementation from >> Java side to hotspot. The compiler will generate different nodes according >> to the architecture: >> - Generate the (predicated) negation node if architecture supports it, >> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation. >> - Generate `"zero.sub(v, m)"` for masked operation if the architecture >> does not have predicate feature, otherwise generate the original pattern >> `"v.xor(-1, m).add(1, m)"`. >> >> So with this patch, the following transformations are applied: >> >> For non-masked negation with NEON: >> >> movi v16.4s, #0x0 >> sub v17.4s, v16.4s, v17.4s ==> neg v17.4s, v17.4s >> >> and with SVE: >> >> mov z16.s, #0 >> sub z18.s, z16.s, z17.s ==> neg z16.s, p7/m, z16.s >> >> For masked negation with NEON: >> >> movi v17.4s, #0x1 >> mvn v19.16b, v18.16b >> mov v20.16b, v16.16b ==> neg v18.4s, v17.4s >> bsl v20.16b, v19.16b, v18.16b bsl v19.16b, v18.16b, v17.16b >> add v19.4s, v20.4s, v17.4s >> mov v18.16b, v16.16b >> bsl v18.16b, v19.16b, v20.16b >> >> and with SVE: >> >> mov z16.s, #-1 >> mov z17.s, #1 ==> neg z16.s, p0/m, z16.s >> eor z18.s, p0/m, z18.s, z16.s >> add z18.s, p0/m, z18.s, z17.s >> >> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 >> machines(note that the non-masked negation benchmarks do not have any >> improvement on X86 since no instructions are changed): >> >> NEON: >> Benchmark Gain >> Byte128Vector.NEG 1.029 >> Byte128Vector.NEGMasked 1.757 >> Short128Vector.NEG 1.041 >> Short128Vector.NEGMasked 1.659 >> Int128Vector.NEG 1.005 >> Int128Vector.NEGMasked 1.513 >> Long128Vector.NEG 1.003 >> Long128Vector.NEGMasked 1.878 >> >> SVE with 512-bits: >> Benchmark Gain >> ByteMaxVector.NEG 1.10 >> ByteMaxVector.NEGMasked 1.165 >> ShortMaxVector.NEG 1.056 >> ShortMaxVector.NEGMasked 1.195 >> IntMaxVector.NEG 1.002 >> IntMaxVector.NEGMasked 1.239 >> LongMaxVector.NEG 1.031 >> LongMaxVector.NEGMasked 1.191 >> >> X86 (non AVX-512): >> Benchmark Gain >> ByteMaxVector.NEGMasked 1.254 >> ShortMaxVector.NEGMasked 1.359 >> IntMaxVector.NEGMasked 1.431 >> LongMaxVector.NEGMasked 1.989 >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881 >> [2] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896 > > src/hotspot/share/opto/vectornode.cpp line 141: > >> 139: case T_BYTE: >> 140: case T_SHORT: >> 141: case T_INT: return Op_NegVI; > > Why not add `Op_NegVB` for `BYTE` and `Op_NegVS` for `SHORT`? > Is there any performance drop for byte/short negation operation if both of > them are handled as a NegVI vector?
The compiler can get the real type info from `Op_NegVI` that can also handle the `BYTE ` and `SHORT ` basic type. I just don't want to add more new IRs which also need more match rules in the ad files. > Is there any performance drop for byte/short negation operation if both of > them are handled as a NegVI vector? >From the benchmark results I showed in the commit message, I didn't see not >any performance drop for byte/short. Thanks! > src/hotspot/share/opto/vectornode.cpp line 1635: > >> 1633: } >> 1634: >> 1635: Node* NegVINode::Ideal(PhaseGVN* phase, bool can_reshape) { > > Much duplication in `NegVINode::Ideal` and `NegVLNode::Ideal`. > Is it possible to refactor the implementation? Yeah, maybe we need a superclass for `NegVINode` and `NegVLNode`. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/7782