[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

peter at cordes dot ca Wed, 20 Sep 2017 08:51:34 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260


--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
> (not (match_test "TARGET_PARTIAL_REG_STALL"))))))

gcc is doing this even with -mtune=core2.

Core2 / Nehalem stall (the front-end) for 2-3 cycles to insert a merging uop
when reading a full register after writing a partial register.  Sandybridge
inserts a merging uop without stalling.  Haswell/Skylake doesn't rename low8 in
the first place (but inserts a merging uop for high8 without stalling).

gcc should be trying to avoid partial-register shenanigans on Core2 / Nehalem,
but the penalty is low enough that it's probably not worth changing
-mtune=generic.

Related: gcc likes to do set-flags / setcc / movzx, but it would be
significantly better to do  xor-zero / set-flags / setcc when possible, when a
zero-extended bool is needed.

setcc into the low8 of a register zeroed with a recognized zeroing idiom avoids
partial-register penalties when reading the full register, and it has a shorter
critical path from test -> 32-bit result.  It also avoids a false dependency on
the old value of the register.  (Fun fact: on early P6 (PPro to Pentium III),
xor-zeroing was not dependency-breaking, but did avoid partial-register
stalls.)

Also, movzx %al, %eax defeats mov-elimination on Intel, so it's always better
to movzx to a different architectural register for zero-extension, modulo
register pressure and not costing any extra instructions total.

Is there already an open bug for either of these latter problems?  (Sorry I
have a bad habit of taking bugs off topic.)

[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

Reply via email to