On Mon, 15 Aug 2022 01:10:54 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> Vector API binary op "`FIRST_NONZERO`" represents the vector operation of >> "`a != 0 ? a : b`", which can be implemented with existing APIs like >> "`compare + blend`". The current implementation is more complex especially >> for the floating point type vectors. The main idea is: >> >> >> 1) mask = a.compare(0, ne); >> 2) b = b.blend(0, mask); >> 3) result = a | b; >> >> >> And for the floating point types, it needs the vector reinterpretation >> between the floating point type and the relative integral type, since the >> final "`OR`" operation is only valid for bitwise integral types. >> >> A simpler implementation is: >> >> >> 1) mask = a.compare(0, eq); >> 2) result = a.blend(b, mask); >> >> >> This could save the final "`OR`" operation and the related reinterpretation >> between FP and integral types. >> >> Here are the performance data of the "`FIRST_NONZERO`" benchmarks (please >> see the benchmark details for byte vector from [1]) on ARM NEON system: >> >> Benchmark (size) Mode Cnt Before After Units >> ByteMaxVector.FIRST_NONZERO 1024 thrpt 15 12107.422 18385.157 >> ops/ms >> ByteMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 9765.282 14739.775 >> ops/ms >> DoubleMaxVector.FIRST_NONZERO 1024 thrpt 15 1798.545 2331.214 >> ops/ms >> DoubleMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 1211.838 1810.644 >> ops/ms >> FloatMaxVector.FIRST_NONZERO 1024 thrpt 15 3491.924 4377.167 >> ops/ms >> FloatMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 2307.085 3606.576 >> ops/ms >> IntMaxVector.FIRST_NONZERO 1024 thrpt 15 3602.727 5610.258 >> ops/ms >> IntMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 2726.843 4210.741 >> ops/ms >> LongMaxVector.FIRST_NONZERO 1024 thrpt 15 1819.886 2974.655 >> ops/ms >> LongMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 1337.737 2315.094 >> ops/ms >> ShortMaxVector.FIRST_NONZERO 1024 thrpt 15 6603.642 9586.320 >> ops/ms >> ShortMaxVector.FIRST_NONZEROMasked 1024 thrpt 15 5222.006 7991.443 >> ops/ms >> >> We can also observe the similar improvement on x86 system. >> >> [1] >> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L266 > > ping again. Could anyone please take a look at this simple patch? Thanks so > much for your time! @XiaohongGong looking... (just back from vacation). ------------- PR: https://git.openjdk.org/jdk/pull/9683