https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written asm loop. (I was using NASM to make a static executable that runs a 100M iteration loop so I could measure with perf). Can you show some asm where this performs better? vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination being an xor-zeroed register. (Either zeroed outside the loop, or inside, or once per 4 converts with an unrolled loop.) I can't construct a case where vcvtsd2ss %xmm1, %xmm1, %xmm0 is worse in any way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst = middle source. I wasn't mixing it with other instructions other than VXORPS, but I don't think anything is going to get rid of its 2nd uop, and choosing both inputs = the same source removes any benefit from dep-breaking the output. If adding a VXORPS helped, its probably due to some other side-effect. Could the effect you saw have been due to code-gen changes for memory sources, maybe vxorps + vcvtsd2ss (mem), %xmm0, %xmm0 vs. vmovsd + vcvtsd2ss %xmm1, %xmm1, %xmm0? (Those should be about equal, but memory-source SS2SD is cheaper, no port5 uop.) ---- BTW, the false-dependency effect is much more obvious with SS2SD, where the latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS. Even without dependency-breaking, repeated vcvtsd2ss %xmm1, %xmm0, %xmm0 can run at 1 per clock (same as with dep breaking), because the port-5 uop that merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd. So latency from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle. With dep-breaking, they both still bottleneck on the port5 uop if you're doing nothing else.