https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
Doesn't change the performance implications, but I just realized I have the
offset-load backwards.  Instead of
        vpsrlw  $8, (%rsi), %xmm1
        vpand   15(%rsi), %xmm2, %xmm0

this algorithm should use
        vpand   1(%rsi), %xmm2, %xmm0     # ideally with rsi 32B-aligned
        vpsrlw  $8, 16(%rsi), %xmm1

Or (with k1 = 0x5555555555555555)
        vmovdqu8    1(%rsi),  %zmm0{k1}{z}   # ALU + load micro-fused
        vmovdqu8    65(%rsi), %zmm1{k1}{z}   # and probably causes CL-split
penalties

Like I said, we should probably avoid vmovdqu8 for loads or stores unless we
actually use masking.  vmovdqu32 or 64 is always at least as good.  If some
future CPU has masked vmovdqu8 without needing an ALU uop, it could be good
(but probably only if it also avoids cache-line split penalties).

https://godbolt.org/g/a1U7hf

See also https://github.com/InstLatx64/InstLatx64 for a spreadsheet of
Skylake-AVX512 uop->port assignments (but it doesn't include masked loads /
stores), and doesn't match IACA for vmovdqu8 zmm stores (which says even
without masking, the ZMM version uses an ALU uop).

Reply via email to