On Tue, 3 Mar 2026 06:12:23 GMT, Xiaohong Gong <[email protected]> wrote:

> Duplicate `ptrue`(`MaskAll`) instructions are generated with different 
> predicate registers on SVE when multiple `VectorMask.not()` operations exist. 
> This increases the predicate register pressure and reduces performance, 
> especially after loop is unrolled.
> 
> Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. 
> `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the 
> cloned `MaskAll` nodes are not shared with each other.
> 
> Since SVE has rules for the `andNot` pattern:
> 
>   match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));
> 
> `MaskAll` node should be cloned only when it is part of the `andNot` pattern 
> instead.
> 
> A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the 
> matcher's commutative vector op list, so their operands are never swapped. As 
> a result, the `andNot` rule does not match when the `XorVMask` operands 
> appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).
> 
> This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) 
> adding the three binary mask bitwise IRs to the commutative op list.
> 
> Following is the performance result of the new added JMH tested on V1 and 
> Grace(V2) machines respecitively:
> 
> V1 (SVE machine with 256-bit vector length):
> 
> Benchmark                                                     Mode  Threads 
> Samples Unit   size  Before     After     Gain
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms  256 54465.231  74374.960  1.365
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms  512 29156.881  39601.358  1.358
> MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       
> 30      ops/ms 1024 15169.894  20272.379  1.336
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms  256 15408.510  19808.722  1.285
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms  512  7906.952  10297.837  1.302
> MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       
> 30      ops/ms 1024  3767.122   5097.853  1.353
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms  256  7762.614  10534.290  1.357
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms  512  3976.759   5123.445  1.288
> MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       
> 30      ops/ms 1024  1937.389   2573.394  1.328
> MaskLogicOperationsB...

Hi, could anyone please help take a look at this PR? Thanks in advance!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30013#issuecomment-4002374980

Reply via email to