AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

peter at cordes dot ca Sat, 20 May 2017 02:30:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636


--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
The point about moves also applies to integer code, since a 64-bit mov requires
an extra byte for the REX prefix (unless a REX prefix was already required for
r8-r15).

I just noticed a case where gcc uses a 64-bit mov to copy a just-zeroed integer
register, when setting up for a 16-byte atomic load (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load for
a single member, and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 for a
7.1.0 regression.  And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for
the store-forwarding stalls from this code with -m32)

// https://godbolt.org/g/xnyI0l
// return the first 8-byte member of a 16-byte atomic object.
#include <atomic>
#include <stdint.h>
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
    node *ptr;    // non-atomic pointer-to-atomic
    uintptr_t count;
};

node *load_nounion(std::atomic<counted_ptr> *p) {
  return p->load(std::memory_order_acquire).ptr;
}

gcc6.3 -std=gnu++11 -O3 -mcx16 compiles this to

        pushq   %rbx
        xorl    %ecx, %ecx
        xorl    %eax, %eax
        xorl    %edx, %edx
        movq    %rcx, %rbx    ### BAD: should be movl %ecx,%ebx.  Or another
xor
        lock cmpxchg16b (%rdi)
        popq    %rbx
        ret

MOVQ is obviously sub-optimal, unless done for padding to avoid NOPs later.

It's debatable whether %rbx should be zeroed with xorl %ebx,%ebx or movl
%ecx,%ebx.

* AMD: copying a zeroed register is always at least as good, sometimes better.
* Intel: xor-zeroing is always best, but on IvB and later copying a zeroed reg
is as good most of the time.  (But not in cases where mov %r10d, %ebx would
cost a REX and xor %ebx,%ebx wouldn't.)

Unfortunately, -march/-mtune doesn't affect the code-gen either way.  OTOH,
there's not much to gain here, and the current strategy of mostly using xor is
not horrible for any CPUs.  Just avoiding useless REX prefixes to save code
size would be good enough.

But if anyone does care about optimally zeroing multiple registers:

-mtune=bdver1/2/3 should maybe use one xorl and three movl (since integer MOV
can run on ports AGU01 as well as EX01, but integer xor-zeroing still takes an
execution unit, AFAIK, and can only run on EX01.)  Copying a zeroed register is
definitely good for vectors, since vector movdqa is handled at rename with no
execution port or latency.

-mtune=znver1 (AMD Ryzen) needs an execution port for integer xor-zeroing (and
maybe vector), but integer and vector mov run with no execution port or latency
(in the rename stage).  XOR-zeroing one register and copying it (with 32-bit
integer or 128-bit vector mov) is clearly optimal.  In
http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen3_InstLatX64.txt, mov
r32,r32 throughput is 0.2, but integer xor-zeroing throughput is only 0.25. 
IDK why vector movdqa throughput isn't 0.2, but the latency data tells us it's
handled at rename, which Agner Fog's data confirms.


-mtune=nehalem and earlier Intel P6-family don't care much: both mov and
xor-zeroing use an execution port.  But mov has non-zero latency, so the
mov-zeroed registers are ready at the earliest 2 cycles after the xor and mov
uops issue.  Also, mov may not preserve the upper-bytes-zeroes property that
avoids partial register stalls if you write AL and then read EAX.  Definitely
don't MOV a register that was zeroed a long time ago: that will contribute to
register-read stalls.  (http://stackoverflow.com/a/41410223/224132). 
mov-zeroing is only ok within about 5 cycles of the xor-zeroing.

-mtune=sandybridge should definitely use four XOR-zeroing instructions, because
MOV needs an execution unit (and has 1c latency), but xor-zeroing doesn't.  
XOR-zeroing also avoids consuming space in the physical register file:
http://stackoverflow.com/a/33668295/224132.

-mtune=ivybridge and later Intel shouldn't care most of the time, but
xor-zeroing is sometimes better (and never worse):  They can handle integer and
SSE MOV instructions in the rename stage with no execution port, the same way
they and SnB handle xor-zeroing.  However, mov-zeroing reads more registers,
which can be a bottleneck (especially if they're cold?) on HSW/SKL.
http://www.agner.org/optimize/blog/read.php?i=415#852.  Apparently
mov-elimination isn't perfect, and it sometimes does use an execution port. 
IDK when it fails.  Also, a kernel save/restore might leave the zeroed source
register no longer in the special zeroed state (pointing to the physical
zero-register, so it and its copies don't take up a register-file entry).  So
mov-zeroing is likely to be worse in the same cases as Nehalem and earlier:
when the source was zeroed a while ago. 


IDK about Silvermont/KNL or Jaguar, except that 64-bit xorq same,same isn't a
dependency-breaker on Silvermont/KNL.  Fortunately, gcc always uses 32-bit xor
for integer registers.


-mtune=generic might take a balanced approach and zero two or three with XOR
(starting with ones that don't need REX prefixes), and use MOVL to copy for the
remaining one or two.  Since MOV may help throughput on AMD (by reducing
execution-port pressure), and the only significant downside for Intel is on
Sandybridge (except for partial-register stuff), it's probably fine to mix in
some MOV.

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

Reply via email to