https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123836
Bug ID: 123836
Summary: riscv: Inefficient code for simple vector reduction.
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rdapp at gcc dot gnu.org
CC: bergner at gcc dot gnu.org, kito at gcc dot gnu.org, law at
gcc dot gnu.org
Target Milestone: ---
Target: riscv
Split off from the ABI PR123824.
We generate non-optimal code for
typedef int v4si __attribute__ ((vector_size (16)));
int test (int accumulator, v4si v1, v4si v2, v4si v3)
{
accumulator &= v3[0] & v3[1] & v3[2] & v3[3];
return accumulator;
}
addi sp,sp,-16
vsetivli zero,4,e32,m1,ta,ma
sd a5,0(sp)
sd a6,8(sp)
vle32.v v2,0(sp)
li a4,-1
vmv.s.x v1,a4
addi sp,sp,16
vredand.vs v1,v2,v1
vmv.x.s a5,v1
and a0,a0,a5
sext.w a0,a0
jr ra
There are actually three different issues here:
First, I'm not entirely happy with "partial" scalar stores followed by a vector
load of larger size from the same address. That can cause load-to-store
forwarding issues on many uarchs. If the forwarding can handle such cases
it might be faster than the alternative
vmv.v.x v2,a5
vslide1down.vx v2,v2,a6
which pays the GPR->VR penalty twice.
Of course the proper vector ABI/calling convention solves this but I think we
should still be able to select between both variants.
Next, the reduction itself:
li a4,-1
vmv.s.x v1,a4
vredand.vs v1,v2,v1
Here we move the neutral element -1 to v1, paying the GPR->VR penalty.
clang just does:
vredand.vs v1, v1, v1
Perhaps that can just be a match or combine pattern?
Finally, the return value:
vmv.x.s a5,v1
and a0,a0,a5
sext.w a0,a0
We don't need the sign extend here. I believe this is a known problem but
I don't have the PR ready. We actually have patterns for eliding a sign
extension of a vmv.x.s result, but not with the and in between.