Re: RFR: 8382052: VectorAPI: Optimize the lanewise BITWISE_BLEND for AArch64 [v3]

Eric Fang Tue, 09 Jun 2026 21:44:21 -0700

On Tue, 9 Jun 2026 13:19:15 GMT, Emanuel Peter <[email protected]> wrote:


>> Eric Fang has updated the pull request with a new target base due to a merge 
>> or a rebase. The incremental webrev excludes the unrelated changes brought 
>> in by the merge/rebase. The pull request contains five additional commits 
>> since the last revision:
>> 
>>  - Fine-tuning the code
>>  - Merge branch 'master' into JDK-8382052-bitwise-blend
>>  - Implement bitwise_blend in IGVN
>>    
>>    The latest changes:
>>    
>>    1. Defined a new IR `VectorBitwiseBlendNode`
>>    2. Do the optimization in IGVN:
>>    // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel)
>>    // XorV(a, AndV(sel, XorV(a, b)), mask) =>
>>    //   VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask)
>>    
>>    3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`.
>>    4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`.
>>  - Merge branch 'master' into JDK-8382052-bitwise-blend
>>  - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND 
>> operation with BSL
>>    
>>    Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
>>    generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
>>    provides a more efficient mapping for this operation through the NEON
>>    `BSL` and SVE `BSL` (bitwise select) instructions.
>>    
>>    This change teaches C2 to recognize the `BITWISE_BLEND` patterns and
>>    lower them to the dedicated AArch64 instructions for better performance.
>>    
>>    The change includes the AArch64 match rules and assembler support,
>>    updates the AArch64 asm tests, adds IR framework nodes for the new mach
>>    instructions, introduces a new jtreg IR test and extends the
>>    MaskedLogicOpts JMH benchmark for 128-bit long type.
>>    
>>    JMH results show **11% - 54%** performance improvements for the
>>    optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on
>>    SVE2, SVE1, and NEON configurations.
>>    
>>    On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>    ```
>>    Benchmark                 Unit    ARRAYLEN Before     Error    After      
>> Error   Uplift
>>    bitwiseBlendOperationInt128       ops/s   256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>>    bitwiseBlendOperationInt128       ops/s   512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>>    bitwiseBlendOperationInt128       ops/s   1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>>    bitwiseBlendOperationLong128      ops/s   256.00   1895.45    13.68    
>> 2140.31    3.68    1.13
>>    bitwiseBlendOperationLong128      ops/s   512.00   938.71     5.32     
>> 1052.16    14.07   1.12
>>    bitwi...
>
> src/hotspot/share/opto/vectornode.cpp line 2798:
> 
>> 2796:   } else if (in(2)->Opcode() == Op_AndV) {
>> 2797:     andv = in(2);
>> 2798:     a = in(1);
> 
> This could be simplified to an or with the same body, no?
> At least: flip the two lines in the if-branch, in all others you assign 
> `andv` first.

Make sense, done, thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3385514963

Re: RFR: 8382052: VectorAPI: Optimize the lanewise BITWISE_BLEND for AArch64 [v3]

Reply via email to