https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68494
--- Comment #2 from Michael Collison <michael.collison at linaro dot org> --- Sorry here is the updated test case. #define NTAPS 4 short taps[NTAPS]; void fir_t5(int len, short * __restrict p, short *__restrict x, short *__restrict taps) { len = len & ~31; for (int i = 0; i < len; i++) { int tmp = 0; for (int j = 0; j < NTAPS; j++) { tmp += x[i - j] * taps[j]; } p[i] = tmp; } } -------------------------------------------------------------------------------- We currently generate a vdup of the scalar taps[j] in the inner loop. Ideally we do not use the vdup and insted use a vmul using a lane.