Duplicate `ptrue`(`MaskAll`) instructions are generated with different 
predicate registers on SVE when multiple `VectorMask.not()` operations exist. 
This increases the predicate register pressure and reduces performance, 
especially after loop is unrolled.

Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. 
`(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the 
cloned `MaskAll` nodes are not shared with each other.

Since SVE has rules for the `andNot` pattern:

  match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1))));

`MaskAll` node should be cloned only when it is part of the `andNot` pattern 
instead.

A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the matcher's 
commutative vector op list, so their operands are never swapped. As a result, 
the `andNot` rule does not match when the `XorVMask` operands appear in the 
opposite order (e.g. `(XorVMask (MaskAll m1) pm)`).

This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) 
adding the three binary mask bitwise IRs to the commutative op list.

Following is the performance result of the new added JMH tested on V1 and 
Grace(V2) machines respecitively:

V1 (SVE machine with 256-bit vector length):

Benchmark                                                     Mode  Threads 
Samples Unit   size  Before     After     Gain
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms  256 54465.231  74374.960  1.365
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms  512 29156.881  39601.358  1.358
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms 1024 15169.894  20272.379  1.336
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms  256 15408.510  19808.722  1.285
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms  512  7906.952  10297.837  1.302
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms 1024  3767.122   5097.853  1.353
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms  256  7762.614  10534.290  1.357
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms  512  3976.759   5123.445  1.288
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms 1024  1937.389   2573.394  1.328
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms  256 30165.102  39632.060  1.313
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms  512 15653.812  20026.600  1.279
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms 1024  7838.684  10795.177  1.377
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms  256 20185.546  21548.108  1.067
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms  512  9549.994  11097.954  1.162
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms 1024  4797.370   5624.987  1.172


Grace(V2, SVE machine with 128-bit vector length):

Benchmark                                                     Mode  Threads 
Samples Unit   size  Before      After    Gain
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms  256 88221.700 114208.097  1.294
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms  512 46472.948  64268.305  1.382
MaskLogicOperationsBenchmark.byteMaskAndNot                   thrpt 1       30  
    ops/ms 1024 24367.417  33957.434  1.393
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms  256 15774.203  27054.729  1.715
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms  512  7938.354  11484.306  1.446
MaskLogicOperationsBenchmark.intMaskAndNot                    thrpt 1       30  
    ops/ms 1024  3973.106   5658.552  1.424
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms  256  7976.768  11533.359  1.445
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms  512  4013.574   5662.615  1.410
MaskLogicOperationsBenchmark.longMaskAndNot                   thrpt 1       30  
    ops/ms 1024  2003.350   2810.982  1.403
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms  256 30464.920  47910.299  1.572
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms  512 15826.314  23330.242  1.474
MaskLogicOperationsBenchmark.shortMaskAndNot                  thrpt 1       30  
    ops/ms 1024  7936.939  11420.379  1.438
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms  256 17008.969  21002.746  1.234
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms  512  8159.229  10648.533  1.305
MaskLogicOperationsBenchmark.highMaskRegisterPressureWithNots thrpt 1       30  
    ops/ms 1024  4004.777   5355.436  1.337

-------------

Commit messages:
 - 8378737: AArch64: Fix SVE match rule issues for VectorMask.andNot()

Changes: https://git.openjdk.org/jdk/pull/30013/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=30013&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8378737
  Stats: 280 lines in 5 files changed: 260 ins; 10 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/30013.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/30013/head:pull/30013

PR: https://git.openjdk.org/jdk/pull/30013

Reply via email to