https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- Should I open a separate bug for the reusing call-preserved regs thing, and retitle this one to the call-reordering issue we ended up talking about here? I always have a hard time limiting an optimization bug report to a single issue, sorry. (In reply to Andrew Pinski from comment #1) > Note teaching this trick requires a huge amount of work as you need to teach > GCC more about order of operands does not matter; this requires work in the > front-end and then in the gimple level and then maybe the middle-end. Ok :( > Is it worth it for the gain, most likely not, you are more likely just to > get better code by not depending on unspecified behavior in C. Writing the code this way intentionally leaves it up to the compiler to choose the optimal order to evaluate foo(a+2) and foo(a). I don't see why forcing the compiler into one choice or the other should be considered "better" for performance, just because gcc doesn't take advantage of its options. (Better for maintainability in case someone adds side-effects to foo(), sure). I should have used __attribute__((pure)) int foo(int); to make it clear that the order of the function calls didn't matter. That would make reordering legal even the calls were separated by a sequence point, wouldn't it? (Of course, it sounds like gcc still wouldn't consider doing the reordering). > ># why lea instead of add rdi,2? > > Because lea does not clobber the flags, so this might be faster, it depends > on the machine. Every OOO x86 CPU renames EFLAGS, because almost every instruction writes flags. There aren't any CPUs where instructions that don't write flags are faster for that reason. (Not writing flags is useful when it lets you reuse some already-set flags for another check with a different condition, or stuff like that, but that's not the case here). On Intel Haswell for example, the LEA can run on port 1 or 5, but the add can run on port 0,1,5,6. Otherwise they're the same (latency, total uops, and code-size). Using `-mtune=haswell` doesn't get it to choose add edi,2 :( (From http://agner.org/optimize/ instruction tables, and Agner's microarch pdf) LEA is special on Atom. I don't remember exactly what its effect is on latency in Atom's in-order pipeline, but LEA happens at a different pipeline stage from normal ALU instructions (actually running on the AGUs). IIRC, that's an earlier stage, so inputs need to be ready sooner. > Also try -Os you might see a difference code. No change with -Os