https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687
Bug ID: 120687
Summary: RISC-V: very poor vector code gen for LMbench bw_mem
test case
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bergner at gcc dot gnu.org
Target Milestone: ---
We've seen some VERY poor vector code gen for the bw_mem test case in LMbench
for certain numbers of loads in the loop. I have extracted a simple test case
that shows the issue. For smallish numbers of loads, I get what I would
generally expect:
linux%~:PR$ cat bw_mem_8.c
int
frd (int *p, int *lastone)
{
int sum = 0;
for (; p <= lastone; p += 8)
sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7];
return sum;
}
linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_8.c
linux%~$ cat bw_mem_8.s
[snip] looking at just main loop body...
.L3:
vsetvli a5,a4,e32,m1,tu,ma
vlseg8e32.v v8,(a0)
slli a3,a5,5
sub a4,a4,a5
add a0,a0,a3
vadd.vv v1,v9,v8
vadd.vv v1,v1,v10
vadd.vv v1,v1,v11
vadd.vv v1,v1,v12
vadd.vv v1,v1,v13
vadd.vv v1,v1,v14
vadd.vv v1,v1,v15
vadd.vv v2,v1,v2
bne a4,zero,.L3
[snip]
If I double the number of loads and update the loop increment to match, I see:
linux%~$ cat bw_mem_16.c
int
frd (int *p, int *lastone)
{
int sum = 0;
for (; p <= lastone; p += 16)
sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7]
+ p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15];
return sum;
}
linux%~$ riscv64-unknown-linux-gnu-gcc -S -O3 -march=rv64gcv bw_mem_16.c
linux%~$ cat bw_mem_16.s
[snip] looking at just main loop body...
.L4:
add s4,a4,a3
addi s2,s4,16
vle32.v v11,0(s2)
vmv1r.v v0,v2
vle32.v v13,0(s4)
addi s1,s4,32
vle32.v v19,0(s1)
addi s0,s4,48
vcompress.vm v21,v11,v0
vmv1r.v v0,v1
vle32.v v10,0(s0)
addi s5,s4,64
vcompress.vm v20,v11,v0
vmv1r.v v0,v2
vle32.v v18,0(s5)
addi t2,s4,80
vcompress.vm v11,v13,v0
vmv1r.v v0,v1
vle32.v v9,0(t2)
vslideup.vi v11,v21,2
vcompress.vm v15,v13,v0
vmv1r.v v0,v2
addi t0,s4,96
vslideup.vi v15,v20,2
vcompress.vm v13,v19,v0
vmv1r.v v0,v1
vle32.v v14,0(t0)
addi t6,s4,112
vcompress.vm v21,v19,v0
vmv1r.v v0,v2
vle32.v v8,0(t6)
addi t5,s4,128
vcompress.vm v20,v10,v0
vmv1r.v v0,v1
vle32.v v17,0(t5)
[snip] ...this goes on for many pages!
vcompress.vm v8,v7,v0
vslideup.vi v10,v9,2
vcompress.vm v7,v6,v0
vadd.vv v3,v3,v12
vcompress.vm v6,v5,v0
vslideup.vi v8,v7,2
vadd.vv v3,v3,v10
vcompress.vm v7,v4,v0
vcompress.vm v5,v24,v0
vadd.vv v3,v3,v8
vslideup.vi v6,v7,2
vcompress.vm v7,v22,v0
vcompress.vm v4,v20,v0
vadd.vv v3,v3,v6
vslideup.vi v5,v7,2
vcompress.vm v6,v18,v0
vadd.vv v3,v3,v5
vslideup.vi v4,v6,2
vadd.vv v3,v3,v4
vle32.v v4,0(sp)
vadd.vv v3,v4,v3
vse32.v v3,0(sp)
bne a3,s3,.L4
[snip] end of main loop.
Counting the number of insns in the loop, I'm seeing over 20 times the number
of instructions in this loop over the 8 element test case!
The original bw_mem test case in LMbench does 128 loads within the loop which
just exacerbates the issue even more.
I'm marking this as a target bug for now until we know more...