[Bug c/70408] reusing the same call-preserved register would give smaller code in some cases

peter at cordes dot ca Fri, 25 Mar 2016 05:02:33 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408


--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
Should I open a separate bug for the reusing call-preserved regs thing, and
retitle this one to the call-reordering issue we ended up talking about here?

I always have a hard time limiting an optimization bug report to a single
issue, sorry.

(In reply to Andrew Pinski from comment #1)
> Note teaching this trick requires a huge amount of work as you need to teach
> GCC more about order of operands does not matter; this requires work in the
> front-end and then in the gimple level and then maybe the middle-end.  

Ok :(

> Is it worth it for the gain, most likely not, you are more likely just to
> get better code by not depending on unspecified behavior in C.

Writing the code this way intentionally leaves it up to the compiler to choose
the optimal order to evaluate foo(a+2) and foo(a).  I don't see why forcing the
compiler into one choice or the other should be considered "better" for
performance, just because gcc doesn't take advantage of its options.  (Better
for maintainability in case someone adds side-effects to foo(), sure).

I should have used  __attribute__((pure)) int foo(int);
to make it clear that the order of the function calls didn't matter.  That
would make reordering legal even the calls were separated by a sequence point,
wouldn't it?  (Of course, it sounds like gcc still wouldn't consider doing the
reordering).

> ># why lea instead of add rdi,2?
> 
> Because lea does not clobber the flags, so this might be faster, it depends
> on the machine.

Every OOO x86 CPU renames EFLAGS, because almost every instruction writes
flags.  There aren't any CPUs where instructions that don't write flags are
faster for that reason.  (Not writing flags is useful when it lets you reuse
some already-set flags for another check with a different condition, or stuff
like that, but that's not the case here).

On Intel Haswell for example, the LEA can run on port 1 or 5, but the add can
run on port 0,1,5,6.  Otherwise they're the same (latency, total uops, and
code-size).  Using `-mtune=haswell` doesn't get it to choose  add edi,2  :(

(From http://agner.org/optimize/ instruction tables, and Agner's microarch pdf)

LEA is special on Atom.  I don't remember exactly what its effect is on latency
in Atom's in-order pipeline, but LEA happens at a different pipeline stage from
normal ALU instructions (actually running on the AGUs).  IIRC, that's an
earlier stage, so inputs need to be ready sooner.

> Also try -Os you might see a difference code.

No change with -Os

[Bug c/70408] reusing the same call-preserved register would give smaller code in some cases

Reply via email to