On Mon, 19 May 2025 03:10:46 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs >> for X86 platforms [1]. However, the current implementation is not optimal >> for AArch64 SVE platform, which natively supports vector instructions for >> subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be >> reused for the subsequent gather load operation on architectures with native >> vector instructions like AArch64 SVE. However, the current implementation >> prevents compiler reuse of these index vectors due to divergent control >> flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for >> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR >> complexity and complicates backend implementation. Furthermore, generating >> `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler >> mid-end to improve efficiency and maintainability across different >> architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating >> duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword >> types by removing the memory offset input and incorporating it into the >> memory base `addr` at the IR level. This simplifies backend implementation, >> reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following >> the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 >> system. Please see the data below: >> >> Benchmark Mode Cnt Unit >> SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 4096 908.31... > > Ping again~ could any one please take a look at this PR? Thanks a lot! > Hi @XiaohongGong , Very nice work!, Looks good to me, will do some testing > and get back. > > Do you have any idea about following regression? > > ``` > GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms > 64 55844.814 48311.847 0.86 > GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms > 256 15139.459 13009.848 0.85 > GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms > 1024 3861.834 3284.944 0.85 > GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms > 4096 938.665 817.673 0.87 > ``` > > Best Regards Yes, I also observed such regression. After analyzing, I found it was caused by the java side changes, which influences the range check elimination inside `IntVector.fromArray()` and `ByteVector.intoArray()` in the benchmark. The root cause is the counted loop in following benchmark case is not recognized by compiler as expected: public void microByteGather256() { for (int i = 0; i < SIZE; i += B256.length()) { ByteVector.fromArray(B256, barr, 0, index, i) .intoArray(bres, i); } } ``` The loop iv phi node is not recognized successfully when C2 recognize the counted loop pattern, because it was casted twice with `CastII` in this case. The ideal graph looks like: Loop \ \ / ------------------------- \ / | Phi | | | CastII | | | CastII | | | \ ConI | \ | | AddVI | |--------------------| Relative code is https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1667. Before the change, the graph should be: Loop \ \ / ------------------------- \ / | Phi | | | CastII | | | | \ ConI | \ | | AddVI | |--------------------| ``` The difference comes from the index generation in `ByteVector.fromArray()` (I mean calling of `IntVector.fromArray()` in java). Before, the `i` in above loop is not directly used by `IntVector.fromArray()`. Instead, it was used by another loop phi node. Hence, there is no additional `CastII`. But after my change, the loop in the java implementation of `ByteVector.fromArray()` is removed, and `i` is directly used by `IntVector.fromArray()` . It will be used by boundary check before loading the indexes. Hence another `CastII` is generated. When recognizing the loop iv phi node, it will check whether the `Phi` is used by a `CastII` first. And get the input of the `CastII` if then. But this check only happens once. See https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1659 if (xphi->Opcode() == Op_Cast(iv_bt)) { xphi = xphi->in(1); } Once the `PhiNode` is casted more than one times, the pattern is not recognized. I think we should refine the logic of recognizing the counted loop, by changing above `if` to `while` to make the iv phi node is recognized successfully. Potential change should be: while (xphi->Opcode() == Op_Cast(iv_bt)) { xphi = xphi->in(1); } I'v tested this change, and found the benchmarks with regression can be improved as before. Consider I'm not familiar with C2's loop transform code, I prefer to do more investigation for this issue, and may fix it with a followed-up patch. Any suggestions? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2892720654