https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277
--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Yes, it looks like the issue is that the single-lane SLP for a 4-element VLS
mode is worse than before when we couldn't handle the conversion. Doesn't that
also affect other, VLS, targets?
A slightly reduced example:
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;
#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
int t0 = s0 + s1;\
int t1 = s0 - s1;\
int t2 = s2 + s3;\
int t3 = s2 - s3;\
d0 = t0 + t2;\
d2 = t0 - t2;\
d1 = t1 + t3;\
d3 = t1 - t3;\
}
inline uint32_t abs2( uint32_t a )
{
return (a+a)^a;
}
int tmp[4][4];
int
x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
uint32_t a0, a1, a2, a3;
int sum = 0;
for( int i = 0; i < 4; i++ )
{
HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
}
return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}
Build options: -O3 -march=rv64gcv bla.c -mno-vector-strict-align
--param=riscv-autovec-mode=V4QI
With "int tmp[4][4]" the "new" SLP reduction succeeds and we get an SLP tree
that starts with the duplicated sum:
bla.c:28:23: note: SLP size 13 vs. limit 36.
bla.c:28:23: note: Final SLP tree for instance 0x2de1eb60:
bla.c:28:23: note: node 0x2de8cb30 (max_nunits=1, refcnt=2) vector(4) int
cycle 0, link 0
bla.c:28:23: note: op template: sum_35 = (int) _12;
bla.c:28:23: note: [l] stmt 0 sum_35 = (int) _12;
bla.c:28:23: note: [l] stmt 1 sum_35 = (int) _12;
bla.c:28:23: note: [l] stmt 2 sum_35 = (int) _12;
bla.c:28:23: note: [l] stmt 3 sum_35 = (int) _12;
[...]
bla.c:28:23: note: Decided to SLP 1 instances. Unrolling factor 1
With "uint tmp[4][4]" (or before the patch) we get:
bla.c:28:23: note: SLP size 35 vs. limit 40.
bla.c:28:23: note: Final SLP tree for instance 0x36fa7c10:
bla.c:28:23: note: node 0x37015460 (max_nunits=4, refcnt=3) vector(4) int
cycle 1, link 0
bla.c:28:23: note: op template: sum_39 = (int) _16;
bla.c:28:23: note: [l] stmt 0 sum_39 = (int) _16;
bla.c:28:23: note: children 0x37015510
[...]
bla.c:28:23: note: Decided to SLP 1 instances. Unrolling factor 4
Of course it's difficult to anticipate that it'll get worse when starting the
discovery but here the tree looks inferior.