Re: RFR: 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation with BSL

Eric Fang Thu, 28 May 2026 19:55:11 -0700

On Thu, 28 May 2026 08:58:31 GMT, Andrew Haley <[email protected]> wrote:


>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
>> provides a more efficient mapping for this operation through the NEON `BSL` 
>> and SVE `BSL` (bitwise select) instructions.
>> 
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
>> them to the dedicated AArch64 instructions for better performance.
>> 
>> The change includes the AArch64 match rules and assembler support, updates 
>> the AArch64 asm tests, adds IR framework nodes for the new mach 
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts 
>> JMH benchmark for 128-bit long type.
>> 
>> JMH results show **11% - 54%** performance improvements for the optimized 
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
>> NEON configurations.
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>> bitwiseBlendOperationInt128      ops/s       512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>> bitwiseBlendOperationInt128      ops/s       1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>> bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
>> 3.68    1.13
>> bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
>> 14.07   1.12
>> bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
>>     2.62        1.11
>> 
>> 
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2051.52    13.85    
>> 2481.44    0.27    1.21
>> bitwiseBlendOperationInt128      ops/s       512.00   995.47     20.77    
>> 1235.10    5.70    1.24
>> bitwiseBlendOperationInt128      ops/s       1024.00  507.73     9.83     
>> 617.59         2.43        1.22
>> bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
>> 5.48    1.23
>> bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
>>     2.32        1.22
>> bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
>>     0.04        1.20
>> 
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2336.17    13.18    
>> 3505.19    19.61   1.50
>> bitwiseBlendOperationInt128      ops/s       512.00   1145.50 ...
>
> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4761:
> 
>> 4759:   match(Set dst_src1 (XorV (Binary src3 (AndV dst_src1 (XorV src2 
>> src3))) pg));
>> 4760:   // Second form: inner XorV may have operands (src3, src2) after 
>> Ideal/GVN.
>> 4761:   // adlc only auto-swaps commutative ops when at least one operand is 
>> a subtree,
> 
> Is there a way to canonicalize this further, so that the two forms are not 
> both necessary?

This is an ADLC limitation, see 
https://github.com/openjdk/jdk/blob/f2f8828188f45d16344c82adfbf951f7409b8825/src/hotspot/share/adlc/formssel.cpp#L3974
and
https://github.com/openjdk/jdk/blob/f2f8828188f45d16344c82adfbf951f7409b8825/src/hotspot/share/adlc/formssel.cpp#L4032

This appears to be intentional rather than accidental, which restrict 
auto-swapping to operand pairs where at least one side is a subtree.

I did try lifting that restriction and hit some real build failure: some 
"commutative" leaves are commutative as a Java/IR operation but order-sensitive 
in their expand lowering. For example, `AArch64's minI_reg_reg`:

instruct minI_reg_reg(iRegINoSp dst, iRegIorL2I src1, iRegIorL2I src2)
%{
  match(Set dst (MinI src1 src2));
  expand %{
    rFlagsReg cr;
    compI_reg_reg(cr, src1, src2);          // operand order matters here
    cmovI_reg_reg_lt(dst, src1, src2, cr);
  %}
%}

If adlc were to auto-swap the leaves of MinI, the `compI/cmov` expand would 
become incorrect. Several platforms have similar patterns, so a safe fix would 
have to audit them all.

But I will try the implementation method suggested by @XiaohongGong first, 
which is to introduce a `BitwiseBlendNode` and then convert this pattern into a 
`BitwiseBlendNode` in IGVN. Perhaps this can avoid writing these match rules. 
Thanks~

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3321849567

Re: RFR: 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation with BSL

Reply via email to