On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
Accessing local arrays element turned into load form (fp + (index << C1)) +
C2 address.
In the case when access is in the loop we got loop invariant computation.
For some reason, moving out that part cannot be done in
loop-invariant passes.
But we can handle that in target-specific hook (legitimize_address).
That provides an opportunity to rewrite memory access more suitable for the
target architecture.
This patch solves the mentioned case by rewriting mentioned case to ((fp +
C2) + (index << C1))
I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
instructions,
.39% of the dynamic count) and dwarfs the regression for gcc (14m
instructions, .0012%
of the dynamic count).
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
(mem_shadd_or_shadd_rtx_p): New predicate.
So I poked a bit more in this space today.
As you may have noted, Manolis's patch still needs another rev. But I
was able to test this patch in conjunction with the f-m-o patch as well
as the additional improvements made to hard register cprop. The net
result was that this patch still shows a nice decrease in instruction
counts on leela. It's a bit of a mixed bag elsewhere.
I dove a bit deeper into the small regression in x264. In the case I
looked at the reason the patch regresses is the original form of the
address calculations exposes a common subexpression ie
addr1 = (reg1 << 2) + fp + C1
addr2 = (reg1 << 2) + fp + C2
(reg1 << 2) + fp is a common subexpression resulting in something like
this as we leave CSE:
t = (reg1 << 2) + fp;
addr1 = t + C1
addr2 = t + C2
mem (addr1)
mem (addr2)
C1 and C2 are small constants, so combine generates
t = (reg1 << 2) + fp;
mem (t+C1)
mem (t+C2)
FP elimination occurs after IRA and we get:
t2 = sp + C3
t = (reg << 2) + t2
mem (t + C1)
mem (t + C2)
Not bad. Manolis's work should allow us to improve that a bit more.
With this patch we don't capture the CSE and ultimately generate
slightly worse code. This kind of issue is fairly inherent in
reassociations -- and given the regression is 2 orders of magnitude
smaller than the improvement my inclination is to go forward with this
patch.
I've fixed a few formatting issues and changed once conditional to use
CONST_INT_P rather than checking the code directory and pushed the final
version to the trunk.
Thanks for your patience.
jeff
commit a16dc729fda9fabd6472d50cce45791cb3b6ada8
Author: Jivan Hakobyan <jivanhakoby...@gmail.com>
Date: Wed Aug 9 13:26:58 2023 -0600
RISC-V: Folding memory for FP + constant case
Accessing local arrays element turned into load form (fp + (index << C1)) +
C2 address.
In the case when access is in the loop we got loop invariant computation.
For
some reason, moving out that part cannot be done in loop-invariant passes.
But
we can handle that in target-specific hook (legitimize_address). That
provides
an opportunity to rewrite memory access more suitable for the target
architecture.
This patch solves the mentioned case by rewriting mentioned case to ((fp +
C2) + (index << C1))
I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
instructions, .39% of the dynamic count) and dwarfs the regression for gcc
(14m
instructions, .0012% of the dynamic count).
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
(mem_shadd_or_shadd_rtx_p): New function.
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 77892da2920..7f2041a54ba 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -1805,6 +1805,22 @@ riscv_shorten_lw_offset (rtx base, HOST_WIDE_INT offset)
return addr;
}
+/* Helper for riscv_legitimize_address. Given X, return true if it
+ is a left shift by 1, 2 or 3 positions or a multiply by 2, 4 or 8.
+
+ This respectively represent canonical shift-add rtxs or scaled
+ memory addresses. */
+static bool
+mem_shadd_or_shadd_rtx_p (rtx x)
+{
+ return ((GET_CODE (x) == ASHIFT
+ || GET_CODE (x) == MULT)
+ && CONST_INT_P (XEXP (x, 1))
+ && ((GET_CODE (x) == ASHIFT && IN_RANGE (INTVAL (XEXP (x, 1)), 1, 3))
+ || (GET_CODE (x) == MULT
+ && IN_RANGE (exact_log2 (INTVAL (XEXP (x, 1))), 1, 3))));
+}
+
/* This function is used to implement LEGITIMIZE_ADDRESS. If X can
be legitimized in a way that the generic machinery might not expect,
return a new address, otherwise return NULL. MODE is the mode of
@@ -1830,6 +1846,32 @@ riscv_legitimize_address (rtx x, rtx oldx
ATTRIBUTE_UNUSED,
rtx base = XEXP (x, 0);
HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
+ /* Handle (plus (plus (mult (a) (mem_shadd_constant)) (fp)) (C)) case.
*/
+ if (GET_CODE (base) == PLUS && mem_shadd_or_shadd_rtx_p (XEXP (base, 0))
+ && SMALL_OPERAND (offset))
+ {
+ rtx index = XEXP (base, 0);
+ rtx fp = XEXP (base, 1);
+ if (REGNO (fp) == VIRTUAL_STACK_VARS_REGNUM)
+ {
+
+ /* If we were given a MULT, we must fix the constant
+ as we're going to create the ASHIFT form. */
+ int shift_val = INTVAL (XEXP (index, 1));
+ if (GET_CODE (index) == MULT)
+ shift_val = exact_log2 (shift_val);
+
+ rtx reg1 = gen_reg_rtx (Pmode);
+ rtx reg2 = gen_reg_rtx (Pmode);
+ rtx reg3 = gen_reg_rtx (Pmode);
+ riscv_emit_binary (PLUS, reg1, fp, GEN_INT (offset));
+ riscv_emit_binary (ASHIFT, reg2, XEXP (index, 0), GEN_INT
(shift_val));
+ riscv_emit_binary (PLUS, reg3, reg2, reg1);
+
+ return reg3;
+ }
+ }
+
if (!riscv_valid_base_register_p (base, mode, false))
base = copy_to_mode_reg (Pmode, base);
if (optimize_function_for_size_p (cfun)