https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
The new asm less bad, but still not good.  PR53133 is closed, but this code-gen
is a new instance of partial-register writing with xor al,al.  Also related:
PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully
Andrew Pinski's planned set of patches to improve that can help back-ends do a
better job?

If we're going to read a 32-bit reg after writing an 8-bit reg (causing a
partial-register stall on Nehalem and earlier), we should be doing

  mov  a, %al       # merge into the low byte of RAX
  ret

Haswell and newer Intel don't rename the low byte partial register separately
from the full register, so they behave like AMD and other non-P6 /
non-Sandybridge CPU: dependency on the full register.  That's good for this
code; in this case the merging is necessary and we don't want the CPU to guess
that it won't be needed later.  The load+ALU-merge uops can micro-fuse into a
single uop for the front end.

 xor %al,%al still has a false dependency on the old value of RAX because it's
not a zeroing idiom; IIRC in my testing it's at least as good to do  mov $0,
%al.  Both instructions are 2 bytes long.

*
https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers
 survey of the ways partial regs are handled on Intel P6 family vs. Intel
Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc.
*
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
- details of my testing on Haswell / Skylake.

----

*If* we still care about  -mtune=nehalem  and other increasingly less relevant
CPUs, we should be avoiding a partial register stall for those tuning options
with something like

   movzbl   a, %edx
   and      $-256, %eax
   or       %edx, %eax

i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a
2-byte xor %al,%al or mov $0, %al

(That's what clang always does, so it's missing the code-size optimization.
https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFFFFFF00u)
| (b&0xFFu); with two register args)

-----

The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a
merging uop is inserted.  The penalty on earlier P6 (PPro / Pentium III) is to
stall for 5-6 cycles until the partial-register write retires.

The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall,
just insert a merging uop.

On later Intel, and AMD, and Silvermont-family Intel, writing AL has a
dependency on the old RAX; it's a merge on the spot.

BTW, modern Intel does still rename AH separately, and merging does require the
front-end to issue a merging uop in a cycle by itself.  So writing AH instead
of AL would be different.

Reply via email to