https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277
--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 15 Oct 2025, rdapp at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277 > > --- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> --- > Yes, it looks like the issue is that the single-lane SLP for a 4-element VLS > mode is worse than before when we couldn't handle the conversion. Doesn't > that > also affect other, VLS, targets? > > A slightly reduced example: > > typedef unsigned char uint8_t; > typedef unsigned short uint16_t; > typedef unsigned int uint32_t; > > #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > int t0 = s0 + s1;\ > int t1 = s0 - s1;\ > int t2 = s2 + s3;\ > int t3 = s2 - s3;\ > d0 = t0 + t2;\ > d2 = t0 - t2;\ > d1 = t1 + t3;\ > d3 = t1 - t3;\ > } > > inline uint32_t abs2( uint32_t a ) > { > return (a+a)^a; > } > > int tmp[4][4]; > > int > x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > { > uint32_t a0, a1, a2, a3; > int sum = 0; > for( int i = 0; i < 4; i++ ) > { > HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] > ); > sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); > } > > return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; > } > > Build options: -O3 -march=rv64gcv bla.c -mno-vector-strict-align > --param=riscv-autovec-mode=V4QI > > With "int tmp[4][4]" the "new" SLP reduction succeeds and we get an SLP tree > that starts with the duplicated sum: > > bla.c:28:23: note: SLP size 13 vs. limit 36. > bla.c:28:23: note: Final SLP tree for instance 0x2de1eb60: > bla.c:28:23: note: node 0x2de8cb30 (max_nunits=1, refcnt=2) vector(4) int > cycle 0, link 0 > bla.c:28:23: note: op template: sum_35 = (int) _12; > bla.c:28:23: note: [l] stmt 0 sum_35 = (int) _12; > bla.c:28:23: note: [l] stmt 1 sum_35 = (int) _12; > bla.c:28:23: note: [l] stmt 2 sum_35 = (int) _12; > bla.c:28:23: note: [l] stmt 3 sum_35 = (int) _12; > [...] > bla.c:28:23: note: Decided to SLP 1 instances. Unrolling factor 1 > > With "uint tmp[4][4]" (or before the patch) we get: > bla.c:28:23: note: SLP size 35 vs. limit 40. > bla.c:28:23: note: Final SLP tree for instance 0x36fa7c10: > bla.c:28:23: note: node 0x37015460 (max_nunits=4, refcnt=3) vector(4) int > cycle 1, link 0 > bla.c:28:23: note: op template: sum_39 = (int) _16; > bla.c:28:23: note: [l] stmt 0 sum_39 = (int) _16; > bla.c:28:23: note: children 0x37015510 > [...] > bla.c:28:23: note: Decided to SLP 1 instances. Unrolling factor 4 > > Of course it's difficult to anticipate that it'll get worse when starting the > discovery but here the tree looks inferior. It's an artifact on how we represent this. The advantage of reduction chains is a reduction in unroll factor and the reduction of the effective chain length and with anything SLP-like, it can avoid the cost of interleaving for loads.
