https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #5 from Peter Cordes <peter at cordes dot ca> --- The new asm less bad, but still not good. PR53133 is closed, but this code-gen is a new instance of partial-register writing with xor al,al. Also related: PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully Andrew Pinski's planned set of patches to improve that can help back-ends do a better job? If we're going to read a 32-bit reg after writing an 8-bit reg (causing a partial-register stall on Nehalem and earlier), we should be doing mov a, %al # merge into the low byte of RAX ret Haswell and newer Intel don't rename the low byte partial register separately from the full register, so they behave like AMD and other non-P6 / non-Sandybridge CPU: dependency on the full register. That's good for this code; in this case the merging is necessary and we don't want the CPU to guess that it won't be needed later. The load+ALU-merge uops can micro-fuse into a single uop for the front end. xor %al,%al still has a false dependency on the old value of RAX because it's not a zeroing idiom; IIRC in my testing it's at least as good to do mov $0, %al. Both instructions are 2 bytes long. * https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers survey of the ways partial regs are handled on Intel P6 family vs. Intel Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc. * https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to - details of my testing on Haswell / Skylake. ---- *If* we still care about -mtune=nehalem and other increasingly less relevant CPUs, we should be avoiding a partial register stall for those tuning options with something like movzbl a, %edx and $-256, %eax or %edx, %eax i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a 2-byte xor %al,%al or mov $0, %al (That's what clang always does, so it's missing the code-size optimization. https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFFFFFF00u) | (b&0xFFu); with two register args) ----- The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a merging uop is inserted. The penalty on earlier P6 (PPro / Pentium III) is to stall for 5-6 cycles until the partial-register write retires. The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall, just insert a merging uop. On later Intel, and AMD, and Silvermont-family Intel, writing AL has a dependency on the old RAX; it's a merge on the spot. BTW, modern Intel does still rename AH separately, and merging does require the front-end to issue a merging uop in a cycle by itself. So writing AH instead of AL would be different.