On Tue, 3 Mar 2026 06:12:23 GMT, Xiaohong Gong <[email protected]> wrote:
> Duplicate `ptrue`(`MaskAll`) instructions are generated with different
> predicate registers on SVE when multiple `VectorMask.not()` operations exist.
> This increases the predicate register pressure and reduces performance,
> especially after loop is unrolled.
>
> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e.
> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the
> cloned `MaskAll` nodes are not shared with each other.
>
> Since SVE has rules for the `andNot` pattern:
>
> match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
>
> `MaskAll` node should be cloned only when it is part of the `andNot` pattern
> instead.
>
> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the
> matcher's commutative vector op list, so their operands are never swapped. As
> a result, the `andNot` rule does not match when the `XorVMask` operands
> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
>
> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2)
> adding the three binary mask bitwise IRs to the commutative op list.
>
> Following is the performance result of the new added JMH tested on V1 and
> Grace(V2) machines respecitively:
>
> V1 (SVE machine with 256-bit vector length):
>
> Benchmark Mode Threads
> Samples Unit size Before After Gain
> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
> 30 ops/ms 256 54465.231 74374.960 1.365
> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
> 30 ops/ms 512 29156.881 39601.358 1.358
> MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1
> 30 ops/ms 1024 15169.894 20272.379 1.336
> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
> 30 ops/ms 256 15408.510 19808.722 1.285
> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
> 30 ops/ms 512 7906.952 10297.837 1.302
> MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1
> 30 ops/ms 1024 3767.122 5097.853 1.353
> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
> 30 ops/ms 256 7762.614 10534.290 1.357
> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
> 30 ops/ms 512 3976.759 5123.445 1.288
> MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1
> 30 ops/ms 1024 1937.389 2573.394 1.328
> MaskLogicOperationsB...
Not an expert of this so can't review it, but added a couple of small comments.
src/hotspot/cpu/aarch64/aarch64.ad line 2687:
> 2685: VectorNode::is_all_ones_vector(m)) {
> 2686: // Check whether n is only used by an AndVMask node.
> 2687: if (n->outcnt() == 1) {
This is not something for this PR, but could this optimization also apply if
the mask was used by more than one node? Is this something that could be done
as a follow up? Or would it not work at all? If it doesn't make doing so it
might be worth adding a comment for future readers?
test/micro/org/openjdk/bench/jdk/incubator/vector/MaskLogicOperationsBenchmark.java
line 67:
> 65: @Benchmark
> 66: public void byteMaskAndNot() {
> 67: VectorMask<Byte> vm1 = VectorMask.fromArray(B_SPECIES, ma, 0);
If what is being benchmarked is the loop, would it make sense to move this to
`@Setup`? Same thing for the other benchmarks below.
-------------
PR Review: https://git.openjdk.org/jdk/pull/30013#pullrequestreview-3959513200
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2945674412
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2945630038