On Mon, 19 May 2025 03:10:46 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs >> for X86 platforms [1]. However, the current implementation is not optimal >> for AArch64 SVE platform, which natively supports vector instructions for >> subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be >> reused for the subsequent gather load operation on architectures with native >> vector instructions like AArch64 SVE. However, the current implementation >> prevents compiler reuse of these index vectors due to divergent control >> flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for >> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR >> complexity and complicates backend implementation. Furthermore, generating >> `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler >> mid-end to improve efficiency and maintainability across different >> architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating >> duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword >> types by removing the memory offset input and incorporating it into the >> memory base `addr` at the IR level. This simplifies backend implementation, >> reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following >> the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 >> system. Please see the data below: >> >> Benchmark Mode Cnt Unit >> SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 4096 908.31... > > Ping again~ could any one please take a look at this PR? Thanks a lot! Hi @XiaohongGong , Very nice work!, Looks good to me, will do some testing and get back. Do you have any idea about following regression? GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 64 55844.814 48311.847 0.86 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 256 15139.459 13009.848 0.85 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 1024 3861.834 3284.944 0.85 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 4096 938.665 817.673 0.87 Best Regards ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2890659302