On Tue, 17 Mar 2026 10:08:01 GMT, Galder Zamarreño <[email protected]> wrote:

>> Duplicate `ptrue`(`MaskAll`) instructions are generated with different 
>> predicate registers on SVE when multiple `VectorMask.not()` operations 
>> exist. This increases the predicate register pressure and reduces 
>> performance, especially after loop is unrolled.
>> 
>> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. 
>> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And 
>> the cloned `MaskAll` nodes are not shared with each other.
>> 
>> Since SVE has rules for the `andNot` pattern:
>> 
>>   match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
>> 
>> `MaskAll` node should be cloned only when it is part of the `andNot` pattern 
>> instead.
>> 
>> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the 
>> matcher's commutative vector op list, so their operands are never swapped. 
>> As a result, the `andNot` rule does not match when the `XorVMask` operands 
>> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
>> 
>> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) 
>> adding the three binary mask bitwise IRs to the commutative op list.
>> 
>> Following is the performance result of the new added JMH tested on V1 and 
>> Grace(V2) machines respecitively:
>> 
>> V1 (SVE machine with 256-bit vector length):
>> 
>> Benchmark                                                     Mode  Threads 
>> Samples Unit   size  Before     After     Gain
>> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
>> 30      ops/ms  256 54465.231  74374.960  1.365
>> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
>> 30      ops/ms  512 29156.881  39601.358  1.358
>> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
>> 30      ops/ms 1024 15169.894  20272.379  1.336
>> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
>> 30      ops/ms  256 15408.510  19808.722  1.285
>> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
>> 30      ops/ms  512  7906.952  10297.837  1.302
>> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
>> 30      ops/ms 1024  3767.122   5097.853  1.353
>> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
>> 30      ops/ms  256  7762.614  10534.290  1.357
>> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
>> 30      ops/ms  512  3976.759   5123.445  1.288
>> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
>> 30      ops/...
>
> src/hotspot/cpu/aarch64/aarch64.ad line 2687:
> 
>> 2685:       VectorNode::is_all_ones_vector(m)) {
>> 2686:     // Check whether n is only used by an AndVMask node.
>> 2687:     if (n->outcnt() == 1) {
> 
> This is not something for this PR, but could this optimization also apply if 
> the mask was used by more than one node? Is this something that could be done 
> as a follow up? Or would it not work at all? If it doesn't make doing so it 
> might be worth adding a comment for future readers?

I'm not so clear what the pointed `mask` means here?

If the mask is the result of `XorVMask`. If it is used by more than one node, I 
don't think its input `MaskAll` should be cloned. Otherwise, the duplicated 
`machnode` for `MaskAll` will be generated while the backend match rule is not 
matched successfully because the child `XorVMask` is not single used. This is 
just the issue that we want to fix here.

If the mask is `MaskAll` (the all ones vector). If it is used by more than one 
node beyond the `AndNot` pattern, the backend match still will not fire because 
the matcher will mark `MaskAll` as shared. `pd_clone_node()` cannot handle such 
cases at the moment. I investigated how to fix this in the matcher, but the 
solution would add noticeable complexity, which does not seem worthwhile given 
the limited benefit, consider `MaskAll` is a loop invariant operation. 
Therefore, I’d prefer to keep the current behavior, but adding a comment here 
makes sense to me.

> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskLogicOperationsBenchmark.java
>  line 67:
> 
>> 65:     @Benchmark
>> 66:     public void byteMaskAndNot() {
>> 67:         VectorMask<Byte> vm1 = VectorMask.fromArray(B_SPECIES, ma, 0);
> 
> If what is being benchmarked is the loop, would it make sense to move this to 
> `@Setup`? Same thing for the other benchmarks below.

Thanks so much for looking at this PR!

> If what is being benchmarked is the loop, would it make sense to move this to 
> @Setup?

Yeah, moving this mask generation to `@Setup` might be fine, but it doesn’t 
really change much compared to the current version. My concerns are:

1. In `@Setup`, we would have to generate different vector masks for each 
element type (`byte, short, int, long`), and then only use the compatible one 
in each benchmark. The code wouldn’t significantly simplify things for me.

2. This brings no performance benefit. If `vm1` is created in `@Setup init()`, 
the vector mask still has to be unboxed in the benchmark method before it is 
used. Vector unboxing is heavier than using the `fromArray()` API directly, 
since it requires two memory loads (loading the vector mask object, then 
loading its payload) while only one for the latter. Even though all of this 
happens outside the loop, it doesn’t help with performance measurement, right?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2950538809
PR Review Comment: https://git.openjdk.org/jdk/pull/30013#discussion_r2950493595

Reply via email to