https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82582
Bug ID: 82582 Summary: not quite optimal code for -2*x*y - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* int foo32(int x, int y, int z) { return -2*x*y - 3*z; } gcc8.0.0 20171015 -O3 https://godbolt.org/g/tzBuHx imull %esi, %edi # x*y leal 0(,%rdx,4), %eax # needs a disp32 = 0 subl %eax, %edx # -3*z negl %edi # -(x*y) leal (%rdx,%rdi,2), %eax # result LEA runs on limited ports, and an index with no base needs a 4-byte disp32 = 0. The critical-path latencies, assuming 2-operand imul is 3 cycles like on Intel: x->res: imul, neg, lea = 5c y->res: imul, neg, lea = 5c z->res: lea, sub, lea = 3c This is better than gcc6.3 / gcc7.2 (which uses 3 LEA and is generally worse). It's also different from gcc4/gcc5 (6c from x to result, but only 2c from z to result, so it's different but not worse or better in all cases). clang5.0 does better: same latencies, smaller code size, and trades one LEA for an ADD: imull %esi, %edi addl %edi, %edi leal (%rdx,%rdx,2), %eax negl %eax subl %edi, %eax x->res: imul, add, sub = 5c y->res: imul, add, sub = 5c z->res: lea, neg, sub = 3c related: poor code-gen for 32-bit code with this. I haven't checked other 32-bit architectures. long long foo64(int x, int y, int z) { return -2LL*x*(long long)y - 3LL*(long long)z; } // also on the godbolt link gcc -m32 uses a 3-operand imul-immediate for `-2`, but some clunky shifting for `-3`. There's also a mull in there. clang5.0 -m32 makes very nice code, using a one-operand imul for -3 and just shld/add + sub/sbb (plus some mov instructions). One-operand mul/imul is 3 uops on Intel with 2 clock throughput, but ADC is 2 uops on Intel pre-Broadwell, so it's nice to avoid that. related: add %esi,%esi / sbb %edi,%edi is an interesting way to sign-extend a 32-bit input into a pair of registers while doubling it. However, if it starts in eax, cltd / add %eax,%eax is much better. (sbb same,same is only recognized as dep-breaking on AMD Bulldozer-family and Ryzen. On Intel it has a false dep on the old value of the register, not just CF).