https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- The point about moves also applies to integer code, since a 64-bit mov requires an extra byte for the REX prefix (unless a REX prefix was already required for r8-r15). I just noticed a case where gcc uses a 64-bit mov to copy a just-zeroed integer register, when setting up for a 16-byte atomic load (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load for a single member, and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 for a 7.1.0 regression. And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for the store-forwarding stalls from this code with -m32) // https://godbolt.org/g/xnyI0l // return the first 8-byte member of a 16-byte atomic object. #include <atomic> #include <stdint.h> struct node; struct alignas(2*sizeof(void*)) counted_ptr { node *ptr; // non-atomic pointer-to-atomic uintptr_t count; }; node *load_nounion(std::atomic<counted_ptr> *p) { return p->load(std::memory_order_acquire).ptr; } gcc6.3 -std=gnu++11 -O3 -mcx16 compiles this to pushq %rbx xorl %ecx, %ecx xorl %eax, %eax xorl %edx, %edx movq %rcx, %rbx ### BAD: should be movl %ecx,%ebx. Or another xor lock cmpxchg16b (%rdi) popq %rbx ret MOVQ is obviously sub-optimal, unless done for padding to avoid NOPs later. It's debatable whether %rbx should be zeroed with xorl %ebx,%ebx or movl %ecx,%ebx. * AMD: copying a zeroed register is always at least as good, sometimes better. * Intel: xor-zeroing is always best, but on IvB and later copying a zeroed reg is as good most of the time. (But not in cases where mov %r10d, %ebx would cost a REX and xor %ebx,%ebx wouldn't.) Unfortunately, -march/-mtune doesn't affect the code-gen either way. OTOH, there's not much to gain here, and the current strategy of mostly using xor is not horrible for any CPUs. Just avoiding useless REX prefixes to save code size would be good enough. But if anyone does care about optimally zeroing multiple registers: -mtune=bdver1/2/3 should maybe use one xorl and three movl (since integer MOV can run on ports AGU01 as well as EX01, but integer xor-zeroing still takes an execution unit, AFAIK, and can only run on EX01.) Copying a zeroed register is definitely good for vectors, since vector movdqa is handled at rename with no execution port or latency. -mtune=znver1 (AMD Ryzen) needs an execution port for integer xor-zeroing (and maybe vector), but integer and vector mov run with no execution port or latency (in the rename stage). XOR-zeroing one register and copying it (with 32-bit integer or 128-bit vector mov) is clearly optimal. In http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen3_InstLatX64.txt, mov r32,r32 throughput is 0.2, but integer xor-zeroing throughput is only 0.25. IDK why vector movdqa throughput isn't 0.2, but the latency data tells us it's handled at rename, which Agner Fog's data confirms. -mtune=nehalem and earlier Intel P6-family don't care much: both mov and xor-zeroing use an execution port. But mov has non-zero latency, so the mov-zeroed registers are ready at the earliest 2 cycles after the xor and mov uops issue. Also, mov may not preserve the upper-bytes-zeroes property that avoids partial register stalls if you write AL and then read EAX. Definitely don't MOV a register that was zeroed a long time ago: that will contribute to register-read stalls. (http://stackoverflow.com/a/41410223/224132). mov-zeroing is only ok within about 5 cycles of the xor-zeroing. -mtune=sandybridge should definitely use four XOR-zeroing instructions, because MOV needs an execution unit (and has 1c latency), but xor-zeroing doesn't. XOR-zeroing also avoids consuming space in the physical register file: http://stackoverflow.com/a/33668295/224132. -mtune=ivybridge and later Intel shouldn't care most of the time, but xor-zeroing is sometimes better (and never worse): They can handle integer and SSE MOV instructions in the rename stage with no execution port, the same way they and SnB handle xor-zeroing. However, mov-zeroing reads more registers, which can be a bottleneck (especially if they're cold?) on HSW/SKL. http://www.agner.org/optimize/blog/read.php?i=415#852. Apparently mov-elimination isn't perfect, and it sometimes does use an execution port. IDK when it fails. Also, a kernel save/restore might leave the zeroed source register no longer in the special zeroed state (pointing to the physical zero-register, so it and its copies don't take up a register-file entry). So mov-zeroing is likely to be worse in the same cases as Nehalem and earlier: when the source was zeroed a while ago. IDK about Silvermont/KNL or Jaguar, except that 64-bit xorq same,same isn't a dependency-breaker on Silvermont/KNL. Fortunately, gcc always uses 32-bit xor for integer registers. -mtune=generic might take a balanced approach and zero two or three with XOR (starting with ones that don't need REX prefixes), and use MOVL to copy for the remaining one or two. Since MOV may help throughput on AMD (by reducing execution-port pressure), and the only significant downside for Intel is on Sandybridge (except for partial-register stuff), it's probably fine to mix in some MOV.