https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866
--- Comment #6 from luoxhu at gcc dot gnu.org --- For V4SI, it is also better to use vector splat and vector rotate operations. revb: .LFB0: .cfi_startproc vspltish %v1,8 vspltisw %v0,-16 vrlh %v2,%v2,%v1 vrlw %v2,%v2,%v0 blr Performance improved from 7.322s to 2.445s with a small benchmark due to load instruction replaced. But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register before Power9, so lvx is still required? vector unsigned long long revb_pwr7_l(vector unsigned long long a) { return vec_rl(a, vec_splats((unsigned long long)32)); } generates: revb_pwr7_l: .LFB1: .cfi_startproc .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry revb_pwr7_l,.-revb_pwr7_l addis %r9,%r2,.LC0@toc@ha addi %r9,%r9,.LC0@toc@l lvx %v0,0,%r9 vrld %v2,%v2,%v0 blr .LC0: .quad 32 .quad 32 .align 4