https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123836

            Bug ID: 123836
           Summary: riscv: Inefficient code for simple vector reduction.
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: bergner at gcc dot gnu.org, kito at gcc dot gnu.org, law at 
gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

Split off from the ABI PR123824.

We generate non-optimal code for

typedef int v4si __attribute__ ((vector_size (16)));
int test (int accumulator, v4si v1, v4si v2, v4si v3)
{
  accumulator &= v3[0] & v3[1] & v3[2] & v3[3];
  return accumulator;
}

        addi    sp,sp,-16
        vsetivli        zero,4,e32,m1,ta,ma
        sd      a5,0(sp)
        sd      a6,8(sp)
        vle32.v v2,0(sp)
        li      a4,-1
        vmv.s.x v1,a4
        addi    sp,sp,16
        vredand.vs      v1,v2,v1
        vmv.x.s a5,v1
        and     a0,a0,a5
        sext.w  a0,a0
        jr      ra

There are actually three different issues here:

First, I'm not entirely happy with "partial" scalar stores followed by a vector
load of larger size from the same address.  That can cause load-to-store
forwarding issues on many uarchs.  If the forwarding can handle such cases
it might be faster than the alternative

        vmv.v.x         v2,a5
        vslide1down.vx  v2,v2,a6

which pays the GPR->VR penalty twice.
Of course the proper vector ABI/calling convention solves this but I think we
should still be able to select between both variants.

Next, the reduction itself:

        li      a4,-1
        vmv.s.x v1,a4
        vredand.vs      v1,v2,v1

Here we move the neutral element -1 to v1, paying the GPR->VR penalty.

clang just does:

        vredand.vs      v1, v1, v1

Perhaps that can just be a match or combine pattern?

Finally, the return value:

        vmv.x.s a5,v1
        and     a0,a0,a5
        sext.w  a0,a0

We don't need the sign extend here.  I believe this is a known problem but
I don't have the PR ready.  We actually have patterns for eliding a sign
extension of a vmv.x.s result, but not with the and in between.

Reply via email to