[Bug target/55212] [SH] Switch to LRA

olegendo at gcc dot gnu.org Sun, 21 Dec 2014 04:36:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55212


--- Comment #91 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Re: [RFC PATCH 9/9] [SH] Split QI/HImode load/store via r0

(In reply to Kazumoto Kojima from comment #83)
> Created attachment 33992 [details]
> a patch for the issue c#77
> 
> Interestingly, this reduces the total text size of CSiBE test ~0.04%
> at -O2 even for the trunk i.e. with the old reload.

I've checked this change to prepare_move_operands without LRA with trunk
r218988, to see whether it should be enabled for non-LRA.

I can confirm the -1140 bytes / -0.04% on the CSiBE set.  However, as mentioned
in comment #82, it results in unnecessary zero extensions before other
logic/arithmetic because combine doesn't (want to) see through the R0 hardreg.  

Unnecessary sign/zero extensions are actually a separate topic (e.g. PR 53987).
 If there was a good sign/zero extension elimination in place, this wouldn't be
an issue here.

I've tried disabling the prepare_move_operands change and instead adding the
following splitters, which are done after combine and before RA:

(define_split
  [(set (match_operand:SI 0 "arith_reg_dest")
    (sign_extend:SI (match_operand:QIHI 1 "displacement_mem_operand")))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (sign_extend:SI (match_dup 1)))
   (set (match_dup 0) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

(define_split
  [(set (match_operand:QIHI 0 "arith_reg_dest")
    (match_operand:QIHI 1 "displacement_mem_operand"))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:QIHI R0_REG) (match_dup 1))
   (set (match_dup 0) (reg:QIHI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

With these two splitters for mem loads I get exactly the same -1140 bytes /
-0.04% on the CSiBE set.

The simple test case

int test_tst (unsigned char* x, int y, int z)
{
  return x[4] ? y : z;
}

does not contain the extu.b insn anymore, but instead we get this:
        mov.b    @(4,r4),r0
        mov    r0,r1
        tst    r1,r1        << should be: tst r0,r0
        bf    .L4
    mov    r6,r5
.L4:
    rts
    mov    r5,r0


Other cases of new unnessecary zero-extension insns are in e.g. in
jpeg-6b/jdcoefct.s.

In linux-2.4.23-pre3-testplatform/arch/testplatform/kernel/signal.s some mov
reg,reg insns end up as:
    extu.b    r0,r0
    mov.b    r0,@(1,r8)
    mov    r9,r0
    shlr16    r0
    extu.b    r0,r0
    mov.b    r0,@(2,r8)
    mov    r9,r0
    shlr16    r0
    shlr8    r0
    mov.b    r0,@(3,r8)
    extu.b    r1,r0
    mov.b    r0,@(4,r8)
    mov    r1,r0
        ....
those can be wallpapered with peepholes though.

I've also tried using the splitters instead of the prepare_move_operands hunk
with LRA.  But then I get spill errors on QI/HImode stores with displacement
addressing.

I've also tried removing the prepare_move_operands hunk with LRA.  Compared
with trunk no-lra I get:
sum:  3368431 -> 3378515    +10084 / +0.299368 %

And LRA + prepare_move_operands hunk vs. trunk no-lra is:
sum:  3368431 -> 3376507    +8076 / +0.239756 %

Doing this kind of pre-RA R0 pre-allocation seems to result in better RA
choices w.r.t. commutative insns such as addition.

After all, maybe it's worth trying out an SH specific R0 pre-alloc pass that
offloads some of the RA choices.  Of course it will not be able to solve issues
that arise when spilling code is generated that uses QI/HImode mem accesses
during the RA/spilling process.

R0 is the most difficult candidate, but I've also seen reports about FPU code
ICEing due FR0 spill failures when there are lots of (interdependent?) FMAC
insns at -O3 (e.g. FP matrix multiplication).  Another register (class), of
which there is only one on SH, would be the MACH/MACL pair.  Currently
MACH/MACL are fixed hardregs.  Early experiments to allow MACH/MACL to be used
by RA and adding the MAC insn showed some problems (see PR 53949 #c3).

[Bug target/55212] [SH] Switch to LRA

Reply via email to