https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- Doesn't change the performance implications, but I just realized I have the offset-load backwards. Instead of vpsrlw $8, (%rsi), %xmm1 vpand 15(%rsi), %xmm2, %xmm0 this algorithm should use vpand 1(%rsi), %xmm2, %xmm0 # ideally with rsi 32B-aligned vpsrlw $8, 16(%rsi), %xmm1 Or (with k1 = 0x5555555555555555) vmovdqu8 1(%rsi), %zmm0{k1}{z} # ALU + load micro-fused vmovdqu8 65(%rsi), %zmm1{k1}{z} # and probably causes CL-split penalties Like I said, we should probably avoid vmovdqu8 for loads or stores unless we actually use masking. vmovdqu32 or 64 is always at least as good. If some future CPU has masked vmovdqu8 without needing an ALU uop, it could be good (but probably only if it also avoids cache-line split penalties). https://godbolt.org/g/a1U7hf See also https://github.com/InstLatx64/InstLatx64 for a spreadsheet of Skylake-AVX512 uop->port assignments (but it doesn't include masked loads / stores), and doesn't match IACA for vmovdqu8 zmm stores (which says even without masking, the ZMM version uses an ALU uop).