https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68494

--- Comment #2 from Michael Collison <michael.collison at linaro dot org> ---
Sorry here is the updated test case.

#define NTAPS 4

short taps[NTAPS];

void fir_t5(int len, short * __restrict p, short *__restrict x, short
*__restrict taps)
{
  len = len & ~31;
  for (int i = 0; i < len; i++)
    {
      int tmp = 0;
      for (int j = 0; j < NTAPS; j++)
        {
          tmp += x[i - j] * taps[j];
        }

      p[i] = tmp;
    }
}

--------------------------------------------------------------------------------

We currently generate a vdup of the scalar taps[j] in the inner loop. Ideally
we do not use the vdup and insted use a vmul using a lane.

Reply via email to