https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #6 from luoxhu at gcc dot gnu.org ---
For V4SI, it is also better to use vector splat and vector rotate operations.

revb:
.LFB0:
        .cfi_startproc
        vspltish %v1,8
        vspltisw %v0,-16
        vrlh %v2,%v2,%v1
        vrlw %v2,%v2,%v0
        blr


Performance improved from 7.322s to 2.445s with a small benchmark due to load
instruction replaced.

But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register
before Power9, so lvx is still required?

vector unsigned long long revb_pwr7_l(vector unsigned long long a)
{
     return vec_rl(a, vec_splats((unsigned long long)32));
} 

generates:

revb_pwr7_l:
.LFB1:
        .cfi_startproc
.LCF1:
0:      addis 2,12,.TOC.-.LCF1@ha
        addi 2,2,.TOC.-.LCF1@l
        .localentry     revb_pwr7_l,.-revb_pwr7_l
        addis %r9,%r2,.LC0@toc@ha
        addi %r9,%r9,.LC0@toc@l
        lvx %v0,0,%r9
        vrld %v2,%v2,%v0
        blr
.LC0:
        .quad   32
        .quad   32
        .align 4

Reply via email to