On Fri, 1 Aug 2025 12:52:21 GMT, Bhavana Kilambi <[email protected]> wrote:
>> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and
>> `MoveI2F` nodes. The implementation follows a similar pattern to what is
>> done with conversion (`Conv*`) nodes. The tests in
>> `TestCompatibleUseDefTypeSize` have been updated with the new expectations.
>>
>> Also added a JMH benchmark which measures throughput (the higher the number
>> the better) for methods that exercise these nodes. On darwin/aarch64 it
>> shows:
>>
>>
>> Benchmark (seed) (size) Mode Cnt
>> Base Patch Units Diff
>> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8
>> 1168.782 1157.717 ops/ms -1%
>> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8
>> 3999.387 7353.936 ops/ms +83%
>> VectorBitConversion.floatToIntBits 0 2048 thrpt 8
>> 1200.338 1188.206 ops/ms -1%
>> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8
>> 4058.248 14792.474 ops/ms +264%
>> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8
>> 3050.313 14984.246 ops/ms +391%
>> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8
>> 3022.691 7379.360 ops/ms +144%
>>
>>
>> The improvements observed are a result of vectorization. The lack of
>> vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that
>> these changes do not affect their performance. These methods do not
>> vectorize because of flow control.
>>
>> I've run the tier1-3 tests on linux/aarch64 and didn't observe any
>> regressions.
>
> src/hotspot/share/opto/vectornode.cpp line 1831:
>
>> 1829:
>> 1830: bool VectorReinterpretNode::implemented(int opc, uint vlen, BasicType
>> src_type, BasicType dst_type) {
>> 1831: if ((src_type == T_FLOAT && dst_type == T_INT) ||
>
> Just a suggestion, do you feel a `switch-case` could be more readable/clear
> in this case? Something like this -
>
>
> bool VectorReinterpretNode::implemented(uint vlen, BasicType src_type,
> BasicType dst_type) {
> switch (src_type) {
> case T_FLOAT:
> if (dst_type != T_INT) return false;
> break;
> case T_INT:
> if (dst_type != T_FLOAT) return false;
> break;
> case T_DOUBLE:
> if (dst_type != T_LONG) return false;
> break;
> case T_LONG:
> if (dst_type != T_DOUBLE) return false;
> break;
> default:
> return false;
> }
> return
> Matcher::match_rule_supported_auto_vectorization(Op_VectorReinterpret, vlen,
> dst_type);
> }
Both options look just fine to me, but I'm happy to re-write it like that if
others also feel the same way.
> test/micro/org/openjdk/bench/java/lang/VectorBitConversion.java line 67:
>
>> 65:
>> 66: @Benchmark
>> 67: public long[] doubleToLongBits() {
>
> Would something like this be more concise (and maybe more readable as well) -
>
> @Benchmark
> public long[] doubleToLongBits() {
> for (int i = 0; i < doubles.length; i++) {
> resultLongs[i] = Double.doubleToLongBits(doubles[i]);
> }
> return resultLongs;
> }
>
>
> The loop should still get vectorized (if vectorizable).
>
> Same for other benchmarks.
Maybe but there's a reason why I wrote these benchmark methods this way.
Keeping each line doing one thing makes it easier to map each line to the
assembly (e.g. `perfasm`) and related IR nodes (e.g. `PrintIdeal`). That IMO is
more important than the conciseness of the benchmark. What do others think?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251893141
PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251889884