https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090
amker at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #10 from amker at gcc dot gnu.org --- Hmm, It's not mentioned at which optimization level the original bug was reported. I suspect O2 because vect_perm instruction is needed after vectorization. So current status is: After ivopt rewriting, we generate below 8 instructions loop at O2 .L14: movl (%r14,%rax,4), %ecx movl (%r14,%rdx,4), %esi movl %esi, (%r14,%rax,4) movl %ecx, (%r14,%rdx,4) addq $1, %rax subq $1, %rdx cmpl %eax, %edx jg .L14 It's better than what was reported. at O3: .L14: movdqu (%rsi,%rdx), %xmm2 movdqa (%r12,%rax), %xmm0 pshufd $27, %xmm2, %xmm1 pshufd $27, %xmm0, %xmm0 movaps %xmm1, (%r12,%rax) addq $16, %rax movups %xmm0, (%rsi,%rdx) subq $16, %rdx cmpq %rax, %rdi jne .L14 Consider this fixed.