On Fri, 1 Aug 2025 12:52:21 GMT, Bhavana Kilambi <bkila...@openjdk.org> wrote:
>> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and >> `MoveI2F` nodes. The implementation follows a similar pattern to what is >> done with conversion (`Conv*`) nodes. The tests in >> `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number >> the better) for methods that exercise these nodes. On darwin/aarch64 it >> shows: >> >> >> Benchmark (seed) (size) Mode Cnt >> Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 >> 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 >> 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 >> 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 >> 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 >> 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 >> 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of >> vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that >> these changes do not affect their performance. These methods do not >> vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any >> regressions. > > src/hotspot/share/opto/vectornode.cpp line 1831: > >> 1829: >> 1830: bool VectorReinterpretNode::implemented(int opc, uint vlen, BasicType >> src_type, BasicType dst_type) { >> 1831: if ((src_type == T_FLOAT && dst_type == T_INT) || > > Just a suggestion, do you feel a `switch-case` could be more readable/clear > in this case? Something like this - > > > bool VectorReinterpretNode::implemented(uint vlen, BasicType src_type, > BasicType dst_type) { > switch (src_type) { > case T_FLOAT: > if (dst_type != T_INT) return false; > break; > case T_INT: > if (dst_type != T_FLOAT) return false; > break; > case T_DOUBLE: > if (dst_type != T_LONG) return false; > break; > case T_LONG: > if (dst_type != T_DOUBLE) return false; > break; > default: > return false; > } > return > Matcher::match_rule_supported_auto_vectorization(Op_VectorReinterpret, vlen, > dst_type); > } Both options look just fine to me, but I'm happy to re-write it like that if others also feel the same way. > test/micro/org/openjdk/bench/java/lang/VectorBitConversion.java line 67: > >> 65: >> 66: @Benchmark >> 67: public long[] doubleToLongBits() { > > Would something like this be more concise (and maybe more readable as well) - > > @Benchmark > public long[] doubleToLongBits() { > for (int i = 0; i < doubles.length; i++) { > resultLongs[i] = Double.doubleToLongBits(doubles[i]); > } > return resultLongs; > } > > > The loop should still get vectorized (if vectorizable). > > Same for other benchmarks. Maybe but there's a reason why I wrote these benchmark methods this way. Keeping each line doing one thing makes it easier to map each line to the assembly (e.g. `perfasm`) and related IR nodes (e.g. `PrintIdeal`). That IMO is more important than the conciseness of the benchmark. What do others think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251893141 PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251889884