History: This is version 2 of the patch.  In the original patch, all 44
fusion opportunities were lumped together in one patch.  Outside of
fusion.md, these changes are fairly small, in that it adds one
alternative to each of the fusion patterns to add xxeval support.
Fusion.md is a generated file (created from genfusion.md) that does all
of the fusion combinations.  Because of these automated changes,
fusion.md had 265 lines that were deleted and 397 lines that were
added.

In version 2 of the patch, I broke the original patch into 45 separate
patches.  The first patch adds the basic support to genfusion.pl,
predicates.md, rs6000.h, and rs6000.md.  The first patch adds the first
fusion case (vector 'AND' fusing into vector 'AND'). The next 43
patches each add one more fusion case.  Then the last case adds the two
test cases.

The multibuff.c benchmark attached to the PR target/117251 compiled for
Power10 PowerPC that implement SHA3 has a slowdown in the current trunk
and GCC 14 compared to GCC 11 - GCC 13, due to excessive amounts of
spilling.

The main function for the multibuf.c file has 3,747 lines, all of which
are using vector unsigned long long.  There are 696 vector rotates (all
rotates are constant), 1,824 vector xor's and 600 vector andc's.

In looking at it, the main thing that steps out is the reason for
either spilling or moving variables is the support in fusion.md
(generated by genfusion.pl) that tries to fuse the vec_andc feeding
into vec_xor, and other vec_xor's feeding into vec_xor.

On the powerpc for power10, there is a special fusion mode that happens
if the machine has a VANDC or VXOR instruction that is adjacent to a
VXOR instruction and the VANDC/VXOR feeds into the 2nd VXOR
instruction.

While the Power10 has 64 vector registers (which uses the XXL prefix to
do logical operations), the fusion only works with the older Altivec
instruction set (which uses the V prefix).  The Altivec instruction
only has 32 vector registers (which are overlaid over the VSX vector
registers 32-63).

By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to
do this fusion, it means that the register allocator has more register
pressure for the traditional Altivec registers instead of the VSX
registers.

In addition, since there are vector rotates, these rotates only work on
the traditional Altivec registers, which adds to the Altivec register
pressure.

Finally in addition to doing the explicit xor, andc, and rotates using
the Altivec registers, we have to also load vector constants for the
rotate amount and these registers also are allocated as Altivec
registers.

Current trunk and GCC 12-14 have more vector spills than GCC 11, but
GCC 11 has many more vector moves that the later compilers.  Thus even
though it has way less spills, the vector moves are why GCC 11 have the
slowest results.

There is an instruction that was added in power10 (XXEVAL) that does
provide fusion between VSX vectors that includes ANDC->XOR and XOR->XOR
fusion.

The latency of XXEVAL is slightly more than the fused VANDC/VXOR or
VXOR/VXOR, so I have written the patch to prefer doing the Altivec
instructions if they don't need a temporary register.

Here are the results for adding support for XXEVAL for the multibuff.c
benchmark attached to the PR.  Note that we essentially recover the
speed with this patch that were lost with GCC 14 and the current trunk:

                               XXEVAL   Trunk   GCC15   GCC14    GCC13
                               ------   -----   -----   -----    -----
Multibuf time in seconds        5.600   6.151   6.129   6.053    5.539
XXEVAL improvement percentage     ---   +9.8%   +9.4%   +8.1%    -1.1%

Fuse VANDC -> VXOR                209     600      600    600      600
Fuse VXOR -> VXOR                   0     241      241    240      120
XXEVAL to fuse ANDC -> XOR (#45)  391       0        0      0        0
XXEVAL to fuse XOR -> XOR (#105)  240       0        0      0        0

Spill vector to stack             140     417      417     403     226
Load spilled vector from stack    490   1,012    1,012   1,000     766
Vector moves                        8      93      100      70      72

XXLANDC or VANDC                  209     600      600     600     600
XXLXOR or VXOR                    953   1,824    1,824   1,824   1,824
XXEVAL                            631       0        0       0       0


Here are the results for adding support for XXEVAL for the singlebuff.c
benchmark attached to the PR.  Note that adding XXEVAL greatly speeds
up this particular benchmark:

                               XXEVAL   Trunk   GCC15   GCC14    GCC13
                               ------   -----   -----   -----    -----
Singlebuf time in seconds       4.429   5.330   5.333   5.315    5.270
XXEVAL improvement percentage     ---  +20.3%  +20.4%  +20.0%   +19.0%

Fuse VANDC -> VXOR                210     600     600     600      600
Fuse VXOR -> VXOR                   0     240     240     240      120
XXEVAL to fuse ANDC -> XOR (#45)  390       0       0       0        0
XXEVAL to fuse XOR -> XOR (#105)  240       0       0       0        0

Spill vector to stack             134     388     388     388      391
Load spilled vector from stack    357     808     808     808      769
Vector moves                       34      80      80      80      119

XXLANDC or VANDC                  210     600     600     600      600
XXLXOR or VXOR                    954   1,824   1,824   1,824    1,824
XXEVAL                            630       0       0       0        0


These patches add the following fusion patterns:

        xxland  => xxland       xxlandc => xxland
        xxlxor  => xxland       xxlor   => xxland
        xxlnor  => xxland       xxleqv  => xxland
        xxlorc  => xxland       xxlandc => xxlandc
        xxlnand => xxland       xxlnand => xxlnor
        xxland  => xxlxor       xxland  => xxlor
        xxlandc => xxlxor       xxlandc => xxlor
        xxlorc  => xxlnor       xxlorc  => xxleqv
        xxlorc  => xxlorc       xxleqv  => xxlnor
        xxlxor  => xxlxor       xxlxor  => xxlor
        xxlnor  => xxlnor       xxlor   => xxlxor
        xxlor   => xxlor        xxlor   => xxlnor
        xxlnor  => xxlxor       xxlnor  => xxlor
        xxlxor  => xxlnor       xxleqv  => xxlxor
        xxleqv  => xxlor        xxlorc  => xxlxor
        xxlorc  => xxlor        xxlandc => xxlnor
        xxlandc => xxleqv       xxland  => xxlnor
        xxlnand => xxlxor       xxlnand => xxlor
        xxlnand => xxlnand      xxlorc  => xxlnand
        xxleqv  => xxlnand      xxlnor  => xxlnand
        xxlor   => xxlnand      xxlxor  => xxlnand
        xxlandc => xxlnand      xxland  => xxlnand

I have committed all of the patches in my backlog (dense math registers, other
-mcpu=future instructions, random bug fixes, support for _Float16 and
__bfloat16, and optimizations for vector logical operations on power10/power11)
into the IBM vendor branch:

        vendors/ibm/gcc-17-future

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: [email protected]

Reply via email to