https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122221

            Bug ID: 122221
           Summary: riscv: reassoc spoils widening opportunity
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

Split off from PR121959.

For the following code

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

void
lul (int *restrict res, uint8_t *restrict a, uint8_t *restrict b, int n)
{
  for (int i = 0; i < n; i++)
      res[i] = (a[i] - b[i]) + 1;
}

we generate with -march=rv64gcv (eliding scalar code):
        vsetvli a5,a3,e16,mf2,ta,ma
        vle8.v  v2,0(a1)
        vle8.v  v3,0(a2)
        vzext.vf2       v1,v2
        vadd.vi v1,v1,1
        vsetvli zero,zero,e8,mf4,ta,ma
        vwsubu.wv       v2,v1,v3
        vsetvli zero,zero,e32,m1,ta,ma
        vsext.vf2       v1,v2
        vse32.v v1,0(a0)

(7 vector insns without vsetvl)

Ideally we'd have:
        vmv.v.i v4,1 # before the loop
        vsetvli a5,a3,e16,mf2,ta,ma
        vle8.v  v2,0(a1)
        vle8.v  v3,0(a2)
        vwsubu.vv v1,v2,v3
        vsetvli zero,zero,e32,m1,ta,ma
        vwadd.vv v1,v1,v4              # Could also be vwadd.vx v1,v1,t0
        vse32.v v1,0(a0)

(5 vector insns without vsetvl)

The culprit is the add that could be a widening add.  Widening adds usually
have less throughput, though, as they still need to do the same amount of
work).
So it's not really clear cut but IMHO worth exploring.

In .vect

  vect__3.8_129 = .MASK_LEN_LOAD (vectp_a.6_126, 8B, { -1, ... }, _128(D),
_150, 0);
  vect_patt_13.9_130 = (vector([4,4]) unsigned short) vect__3.8_129;
  vect__6.12_135 = .MASK_LEN_LOAD (vectp_b.10_132, 8B, { -1, ... }, _134(D),
_150, 0);
  vect_patt_47.13_136 = (vector([4,4]) unsigned short) vect__6.12_135;
  vect_patt_49.14_137 = vect_patt_13.9_130 - vect_patt_47.13_136;
  vect_patt_50.15_138 = VIEW_CONVERT_EXPR<vector([4,4]) signed
short>(vect_patt_49.14_137);
  vect_patt_53.17_140 = vect_patt_49.14_137 + { 1, ... };
  vect_patt_54.18_141 = VIEW_CONVERT_EXPR<vector([4,4]) signed
short>(vect_patt_53.17_140);
  vect_patt_55.19_142 = (vector([4,4]) int) vect_patt_54.18_141;

we still add 1 to the result of the subtraction.

During reassoc we break up the subtraction into a + -b and
reorder a + -b + 1 into a + 1 + -b.

This then results in
  aa = add_widen_both (a, 1)
  diff = sub_widen_single (aa, b)
  extend (diff)

which basically wastes the widening operations by using them on
two individual operands instead of two steps.

Not sure how to prevent reassoc from doing that, short of recognizing
a widening sub right away.  We don't currently support this IIRC
because the widen-arithmetic IFNs expect even/odd or lo/hi rather than
our direct scheme.

Without reassoc we have:
        vsetvli a5,a3,e8,mf4,ta,ma
        vle8.v  v2,0(a2)
        vle8.v  v3,0(a1)
        vwsubu.vv       v1,v3,v2
        vsetvli zero,zero,e16,mf2,ta,ma
        vadd.vi v1,v1,1
        vsetvli zero,zero,e32,m1,ta,ma
        vsext.vf2       v2,v1
        vse32.v v2,0(a0)

Still, we do not transform vadd + vsext/vzext into vwadd (just vsext/vzext +
vadd).  Maybe we should?  What complicates things slightly is that we don't
have an immediate variant of the widening insn, so would need to use the
vector-vector one.

Reply via email to