https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2 from Peter Cordes ---
Besides code-size, uop-cache size is a factor for Intel CPUs. imul is only a
single uop, while neg/and is 2 uops. Total number of instructions is a factor
for other CPUs, too, but only locally. (Saving uop-cache space can mean
speedups for *other* code that doesn't get evicted).
If the operation isn't part of a long dependency chain, imul is a better choice
on almost all CPUs. Let OOO execution sort it out.
When latency matters some, we have to weigh the tradeoff of code-size / more
insns and uops vs. slightly (or much) higher latency.
Agner Fog's instruction tables indicate that 32bit imul is probably ok for
tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but
absolutely not with tune=atom. Maybe not with tune=silvermont either, but it
does have modest OOO capabilities to hide the latency. It's not as wide, so
saving insns maybe matters more?). I'm not sure if tune=intel is supposed to
put much weight on pre-Silvermont Atom.
From Agner Fog's spreadsheet, updated 2016-Jan09:
uops/m-ops latency recip-throughput execution pipe/port
Intel:SnB-family(Sandybridge through Skylake)
imul r32,r32: 1 31 p1
imul r64,r64: 1 31 p1
AMD:bdver1-3
imul r32,r32: 1 42 EX1
imul r64,r64: 1 64 EX1
Intel:Silvermont
imul r32,r32: 1 31 IP0
imul r64,r64: 1 52 IP0
AMD:bobcat/jaguar
imul r32,r32: 1 31 I0
imul r64,r64: 1 64 I0
old HW
Intel:Nehalem
imul r32,r32: 1 31 p1
imul r64,r64: 1 31 p0
Intel:Merom/Penryn(Core2)
imul r32,r32: 1 31 p1
imul r64,r64: 1 52 p0 (same as FP mul, maybe
borrows its wider multiplier?)
Intel:Atom
imul r32,r32: 1 52 Alu0,Mul
imul r64,r64: 6 13 11 Alu0,Mul
AMD:K8/K10
imul r32,r32: 1 31 ALU0
imul r64,r64: 1 42 ALU0_1 (uses units 0 and 1)
VIA:Nano3000
imul r32,r32: 1 21 I2
imul r64,r64: 1 52 MA
If gcc keeps track of execution port pressure at all, it should also avoid imul
when surrounding code is multiply-heavy (or doing other stuff that also
contends for the same resources as imul). I didn't check on neg/and, but I
assume every microarchitecture can run them on any port with one cycle latency
each.
getting off topic here:
tune=generic should account for popularity of CPUs, right? So I hope it won't
sacrifice much speed for SnB-family in order to avoid something that's slow on
Pentium4, I hope. (e.g. P4 doesn't like inc/dec, but all other CPUs rename the
carry flag separately to avoid the false dep. Not a great example, because
that only saves a couple code bytes. shrd isn't a good example, because it's
slow even on AMD Bulldozer.)
Is there a tune=no_glass_jaws that *will* give up speed (or code size) for
common CPUs in order to avoid things that are *really* bad on some rare
microarchitectures, (especially old ones)? Or maybe a tune=desktop to doesn't
care what's slow on Atom/Jaguar? People distributing binaries that probably
won't be used on Atom/Silvermont netbooks might use that.
Anyway, I think it would be neat to have the option of making a binary that
will be quite good on SnB, not have major problems on recent AMD, but I don't
care if it has the occasional slow instruction on Atom or K8. Or alternatively
to have a binary that doesn't suck badly anywhere.