https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152

Lin Li <lilin at masscore dot cn> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lilin at masscore dot cn

--- Comment #1 from Lin Li <lilin at masscore dot cn> ---
This is indeed a very good idea. However, I wonder if you have noticed that if
this is done, certain scenarios will generate a lot of vcompress instructions,
which will actually make the performance extremely poor. I encountered this
problem on both the SG2044 and my own RISC-V platform. 

For 462.libquantum, doing this would result in a performance drop of around
60%. The fundamental reason is that the execution efficiency of vcompress is
too low.

.L77:
        sub     a5,a5,a7
        vsetvli zero,a5,e64,m2,ta,ma
        vle64.v v2,0(t1)
        vsetvli zero,a7,e64,m2,ta,ma
        vle64.v v10,0(a1)
        vmv1r.v v0,v6
        vsetivli        zero,4,e64,m2,ta,ma
        mv      a5,a4
        vcompress.vm    v8,v2,v0
        addi    t1,t1,64
        vcompress.vm    v2,v10,v0
        vslideup.vi     v2,v8,2
        vand.vv v0,v2,v4
        vxor.vv v2,v2,v12
        vmseq.vv        v0,v0,v4
        bleu    a4,t4,.L78
        li      a5,4


For the example you provided, gcc-trunk is indeed capable of using strided load
without adding option '-mno-autovec-segment', but it still generates the
vcompress instruction. For 462.libquantum, using option '-march=rv64gcv_zvl*b
-mrvv-vector-bits=zvl -mno-autovec-segment -mrvv-max-lmul=m2/dynamic' can
reproduce the performance issue I mentioned(https://godbolt.org/z/eWzT99n5T) .

I think this might not be entirely due to the RISC-V machine I'm using?

Reply via email to