[Bug target/77308] surprisingly large stack usage for sha512 on arm

wilco at gcc dot gnu.org Tue, 01 Nov 2016 04:34:08 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308


--- Comment #35 from wilco at gcc dot gnu.org ---
(In reply to Richard Earnshaw from comment #30)
> (In reply to wilco from comment #29)
> >  Combine could help with
> > merging 2 loads/stores into a single instruction.
> 
> No, combine works strictly on dataflow dependencies.  Two stores cannot be
> dataflow related so won't be combined.  Loads would only be dataflow related
> if both loads fed into *exactly* one data-processing instruction after the
> split.  That's unlikely to happen so I very much dobut it would happen there
> either.

Right, so then either we need to look further when creating ldm/ldrd or when
splitting use a parallel of 2 SI mode loads.(In reply to Richard Earnshaw from
comment #33)
> (In reply to Wilco from comment #32)
> > (In reply to Bernd Edlinger from comment #31)
> > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > But one ldrd must be shorter than two ldr ?
> > > 
> > > That seems wrong...
> > 
> > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > offset is in range will never increase code size so should always be 
> > enabled.
> 
> The logic is certainly strange.  Some cores run LDRD less quickly than they
> can do LDM, or even two independent loads.  I suspect the logic is meant to
> be: use LDRD if available and not (optimizing for speed on a slow
> LDRD-device).

The issue is that the behaviour is not consistent. If DI mode accesses are
split early, LDRD is not used, but if not split, LDRD is used even on cores
where LDRD is not preferred or slow.

Selecting -mcpu=cortex-a57 while splitting early gives:

t0p:
        ldrd    r3, r2, [r0]
        adds    r3, r3, #1
        adc     r2, r2, #0
        strd    r3, r2, [r0]
        bx      lr

But with -mcpu=cortex-a53 (with -O2 or -Os):

t0p:
        ldr     r3, [r0]
        ldr     r2, [r0, #4]
        adds    r3, r3, #1
        str     r3, [r0]
        adc     r2, r2, #0
        str     r2, [r0, #4]
        bx      lr

GCC currently emits LDRD for both cases - so clearly LDRD was preferred...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

Reply via email to