https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Hongtao.liu from comment #9) > Or the backend add combine helper insn to match > > Failed to match this instruction: > (set (reg:V8HI 90) > (rotate:V8HI (reg:V8HI 91) > (const_int 8 [0x8]))) letency of sequence in bswap_epi16 is 3, but 5 for vpshufb w/ memory operand. it looks to me gcc's version is better. bswap_epi16(short __vector(8)): vpsllw xmm1, xmm0, 8 vpsrlw xmm0, xmm0, 8 vpor xmm0, xmm1, xmm0 ret foo(char __vector(16)): vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] ret