[Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves

2021-02-18 Thread wilson at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #5 from Jim Wilson  ---
Neither of the two patches I mentioned in comment 1 can fix the problem by
themselves, as we still have a mix of SImode and DImode operations.

I looked at REE.  It doesn't work because there is more than one reaching def. 
But even if it did work, I don't think it would completely solve the problem
because it runs after register allocation and hence won't be able to remove
move instructions.

To get the best result, we need the register allocator to take two registers
with different modes with overlapping live ranges, and realize that they can be
allocated to the same hard reg because the overlapping uses are
non-conflicting.  I haven't tried looking at the register allocator, but it
doesn't seem like a good way to try to solve the problem.

We have an inconvenient mix of SImdoe and DImode because we don't have SImode
compare and branch instructions.  That requires sign extending 32-bit values to
64-bit to compare them, which then results in the sign extend and register
allocation optimization issues.  it is unlikely that 32-bit compare and branch
instructions will be added to the ISA though.

One useful thing I noticed is that the program is doing a max operation, and
the B extension adds a max instruction.  Having one instruction instead of a
series of instructions including a branch to compute max makes the optimization
issues easier, and gcc does give the right result in this case.  Using a
compiler with B support I get
lw  a4,0(a5)
lw  a2,0(a3)
addia5,a5,4
addia3,a3,4
addwa4,a4,a2
max a0,a4,a0
bne a5,a1,.L2
which is good code with the extra moves and sign-extends removed.  So I have a
workaround of sorts, but only if you have the B extension.

The -mtune=sifive-7-series can support conditional move via macro fusion, I was
hopeful that this would work as well as max, but unfortunately the sign-extend
that was removed in the max case does't get removed in the conditional move
case.  Also, the conditional move is 2-address, and the register allocator ends
up needing a reload, which gives us the unwanted mv again.  So the code in this
case is the same as without the option.  I didn't check to see if this is
fixable.

[Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves

2021-02-17 Thread wilson at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #4 from Jim Wilson  ---
With this testcase

extern void sub2 (void);

void
sub (int *i, int *j)
{
  int k = *i + 1;
  *j = k;
  if (k == 0)
sub2 ();
}

Compiling without the riscv_rtx_cost patch, I get
lw  a5,0(a0)
addiw   a4,a5,1
sw  a4,0(a1)
beq a4,zero,.L4
compiling with the riscv_rtx_cost patch, I get
lw  a5,0(a0)
addiw   a4,a5,1
sw  a4,0(a1)
addiw   a5,a5,1
beq a5,zero,.L4

The problem here is that we have RTL
(insn 9 7 10 2 (set (reg:SI 76)
(plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
(const_int 1 [0x1]))) "tmp.c":6:7 3 {addsi3}
 (expr_list:REG_DEAD (reg:DI 78 [ *i_4(D) ])
(nil)))
(insn 10 9 11 2 (set (reg/v:DI 73 [ k ])
(sign_extend:DI (reg:SI 76))) "tmp.c":6:7 90 {extendsidi2}
 (nil))
with the SImode 76 used for the sw and the DImode 73 used for the beq.  With
the riscv_rtx_cost change, which makes the sign_extend add cheap, combine folds
the add into the sign-extend to get
(insn 9 7 10 2 (set (reg:SI 76)
(plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
(const_int 1 [0x1]))) "tmp.c":6:7 3 {addsi3}
 (nil))
(insn 10 9 11 2 (set (reg/v:DI 73 [ k ])
(sign_extend:DI (plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
(const_int 1 [0x1] "tmp.c":6:7 5 {*addsi3_extended}
 (expr_list:REG_DEAD (reg:DI 78 [ *i_4(D) ])
(nil)))
and now we have two adds which is wrong.  Without the patch, combine does
nothing, then ree manages to merge the sign_extend into the add and subseg the
sw, so we end up with only the one addw and no explicit sign extend.

My take on this is that we never should have generated the SImode add in the
first place, because we don't actually have that instruction.  If we would have
generated the sign-extended add at rtl generation time, and subreged the
result, then we would have gotten the right result.  In theory.

So I think the riscv_rtx_cost change is useful, but only if combined with the
rtl generation change I proposed in comment 1.

[Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves

2021-02-05 Thread wilson at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #3 from Jim Wilson  ---
I suppose cost model problems could explain why combine didn't do the
optimization.  I didn't have a chance to look at that.

I still think there is a fundmental problem with how we represent SImode
operations, but again cost model problems could explain why my experiments to
fix that didn't work as expected.  I probably didn't look at that when I was
experimenting with riscv.md changes.

Your patch does look useful, but setting cost to 1 for MULT is wrong, and would
be just as wrong for DIV.  That is OK for PLUS, MINUS, and NEG though.  I think
a better option is to set *total = 0 and return false.  That gives no extra
cost to the sign extend, and recurs to get the proper cost for the operation
underneath.  That would work for MUL and DIV.  I found code in the rs6000 port
that does this.

[Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves

2021-02-05 Thread kito at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

Kito Cheng  changed:

   What|Removed |Added

 CC||kito at gcc dot gnu.org

--- Comment #2 from Kito Cheng  ---
Here is a quick patch for fix part of this issue, it seems like because our
cost model is inprecise, but I guess I need run benchmark to make sure the
performance and code size didn't get any regression. 

find_max_i32:
lui a4,%hi(.LANCHOR0)
addia4,a4,%lo(.LANCHOR0)
addia3,a4,1024
addia6,a4,400
li  a0,0
.L3:
lw  a5,0(a4)
lw  a2,0(a3)
addia4,a4,4
addia3,a3,4
addwa1,a5,a2
addwa5,a5,a2
bge a1,a0,.L2
mv  a5,a0
.L2:
sext.w  a0,a5
bne a4,a6,.L3
ret



Patch:
diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
index d489717b2a5..b8c9f7200ce 100644
--- a/gcc/config/riscv/riscv.c
+++ b/gcc/config/riscv/riscv.c
@@ -1879,6 +1879,15 @@ riscv_rtx_costs (rtx x, machine_mode mode, int
outer_code, int opno ATTRIBUTE_UN
}
   /* Fall through.  */
 case SIGN_EXTEND:
+  if (TARGET_64BIT && !REG_P (XEXP (x, 0)))
+   {
+ int code = GET_CODE (XEXP (x, 0));
+ if (code == PLUS || code == MINUS || code == NEG || code == MULT)
+   {
+ *total = COSTS_N_INSNS (1);
+ return true;
+   }
+   }
   *total = riscv_extend_cost (XEXP (x, 0), GET_CODE (x) == ZERO_EXTEND);
   return false;

[Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves

2021-02-05 Thread wilson at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

Jim Wilson  changed:

   What|Removed |Added

 CC||wilson at gcc dot gnu.org

--- Comment #1 from Jim Wilson  ---
The extra move instruction is a side effect of how the riscv64 toolchain
handles 32-bit arithmetic.  We lie to the compiler and tell it that we have
instructions that produce 32-bit results.  In fact, we only have instructions
that produce 64-bit sign-extended 32-bit results.  The lie means that the RTL
has some insns with SImode output and some instructions with DImode outputs,
and sometimes we end up with nop moves to convert between the modes.  In this
case, it is peephole2 after regalloc that notices a SImode add followed by a
sign-extend, and converts it to a sign-extending 32-bit add followed by a move,
but can't eliminate the move because we already did register allocation.

This same problem is also why we get the unnecessary sext after the label, as
peephole can't fix that.

This problem has been on my todo list for a few years, and I have ideas of how
to fix it, but I have no idea when I will have time to try to fix it. I did
document it for the RISC-V International Code Speed Optimization task group.
https://github.com/riscv/riscv-code-speed-optimization/blob/main/projects/gcc-optimizations.adoc
This one is the first one in the list.