On Tue, 3 Mar 2026 06:12:23 GMT, Xiaohong Gong <[email protected]> wrote:

> Duplicate `ptrue`(`MaskAll`) instructions are generated with different 
> predicate registers on SVE when multiple `VectorMask.not()` operations exist. 
> This increases the predicate register pressure and reduces performance, 
> especially after loop is unrolled.
> 
> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. 
> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the 
> cloned `MaskAll` nodes are not shared with each other.
> 
> Since SVE has rules for the `andNot` pattern:
> 
>   match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
> 
> `MaskAll` node should be cloned only when it is part of the `andNot` pattern 
> instead.
> 
> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the 
> matcher's commutative vector op list, so their operands are never swapped. As 
> a result, the `andNot` rule does not match when the `XorVMask` operands 
> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
> 
> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) 
> adding the three binary mask bitwise IRs to the commutative op list.
> 
> Following is the performance result of the new added JMH tested on V1 and 
> Grace(V2) machines respecitively:
> 
> V1 (SVE machine with 256-bit vector length):
> 
> Benchmark                                                     Mode  Threads 
> Samples Unit   size  Before     After     Gain
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms  256 54465.231  74374.960  1.365
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms  512 29156.881  39601.358  1.358
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms 1024 15169.894  20272.379  1.336
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms  256 15408.510  19808.722  1.285
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms  512  7906.952  10297.837  1.302
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms 1024  3767.122   5097.853  1.353
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms  256  7762.614  10534.290  1.357
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms  512  3976.759   5123.445  1.288
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms 1024  1937.389   2573.394  1.328
> MaskLogicOperationsB...

Not an expert of this so can't review it, but added a couple of small comments.

src/hotspot/cpu/aarch64/aarch64.ad line 2687:

> 2685:       VectorNode::is_all_ones_vector(m)) {
> 2686:     // Check whether n is only used by an AndVMask node.
> 2687:     if (n->outcnt() == 1) {

This is not something for this PR, but could this optimization also apply if 
the mask was used by more than one node? Is this something that could be done 
as a follow up? Or would it not work at all? If it doesn't make doing so it 
might be worth adding a comment for future readers?

test/micro/org/openjdk/bench/jdk/incubator/vector/MaskLogicOperationsBenchmark.java
 line 67:

> 65:     @Benchmark
> 66:     public void byteMaskAndNot() {
> 67:         VectorMask<Byte> vm1 = VectorMask.fromArray(B_SPECIES, ma, 0);

If what is being benchmarked is the loop, would it make sense to move this to 
`@Setup`? Same thing for the other benchmarks below.

-------------

PR Review: https://git.openjdk.org/jdk/pull/30013#pullrequestreview-3959513200
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2945674412
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2945630038

Reply via email to