On Thu, 27 Nov 2025 01:42:07 GMT, Xiaohong Gong <[email protected]> wrote:
> The current subword (`byte`/`short`) gather load API implementation is not > well-suited for platforms that provide native vector instructions for these > operations. As **discussed in PR [1]**, we'd like to re-implement these APIs > with a **unified cross-platform** solution. > > The main idea is to re-implement the API at Java-level, by performing > multiple sub-gather operations. Each sub-gather operation loads a portion of > elements using a specific index vector by calling the HotSpot intrinsic API. > The partial results are then merged using vector `slice` and `or` operations. > This design simplifies the VM compiler intrinsic implementation and better > aligns with the Vector API design principles. > > Key changes: > 1. Re-implement the subword gather load API at the Java level. The HotSpot > intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector > index parameters from four (vix1-vix4) to a single parameter. > 2. Adjust the compiler intrinsic implementation to support the new Java API, > including updates to the x86 backend implementation. > > The performance impact varies across different scenarios on X86. I tested the > performance with different AVX levels on an X86 machine that supports AVX512. > To achieve optimal performance, I also **applied PR [2]**, which improves the > performance of the **`slice()`** API on X86. Following is the summarized > performance gains, where: > > - "non masked" means the gather operation is not the masked gather API. > - "masked" means the gather operation is the masked gather API. > - "1 gather cases" means the gather API is implemented with a single gather > operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`. > - "2 gather cases" means the gather API is implemented with 2 parts of gather > operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`. > - "4 gather cases" means the gather API is implemented with 4 parts of gather > operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`. > - "Un-intrinsified" means the gather operation is not supported to be > intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`. > The singificant performance uplifts comes from the Java-level changes which > removes the vector index generation and range checks for such cases. > > > ---------------------------------------------------------------------------- > | UseAVX=3 | UseAVX=2 | > |-----------------------------|-----------------------------| > | non maske... Hi @iwanowww , @PaulSandoz , @sviswa7, @jatin-bhateja, this is a refactoring patch for subword gather-load APIs together with the X86 changes as we discussed in https://github.com/openjdk/jdk/pull/26236. Could you please help take a look? Since I'm not quite familiar with X86 instructions, any feedback or help from @sviswa7 or @jatin-bhateja would be much helpful. There are performance regressions with current version, but I think it still has improvement opportunities for the X86 codegen. Hence, I'd appreciate for any help on that! Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/28520#issuecomment-3583917222
