[Bug target/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

rguenther at suse dot de via Gcc-bugs Wed, 15 Oct 2025 04:18:59 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277


--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 15 Oct 2025, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277
> 
> --- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> Yes, it looks like the issue is that the single-lane SLP for a 4-element VLS
> mode is worse than before when we couldn't handle the conversion.  Doesn't 
> that
> also affect other, VLS, targets?
> 
> A slightly reduced example:
> 
> typedef unsigned char uint8_t;
> typedef unsigned short uint16_t;
> typedef unsigned int uint32_t;
> 
> #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
>     int t0 = s0 + s1;\
>     int t1 = s0 - s1;\
>     int t2 = s2 + s3;\
>     int t3 = s2 - s3;\
>     d0 = t0 + t2;\
>     d2 = t0 - t2;\
>     d1 = t1 + t3;\
>     d3 = t1 - t3;\
> }
> 
> inline uint32_t abs2( uint32_t a )
> {
>   return (a+a)^a;
> }
> 
> int tmp[4][4];
> 
> int
> x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
> {
>     uint32_t a0, a1, a2, a3;
>     int sum = 0;
>     for( int i = 0; i < 4; i++ )
>     {
>         HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
> );
>         sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
>     }
> 
>     return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> }
> 
> Build options:  -O3 -march=rv64gcv bla.c -mno-vector-strict-align
> --param=riscv-autovec-mode=V4QI
> 
> With "int tmp[4][4]" the "new" SLP reduction succeeds and we get an SLP tree
> that starts with the duplicated sum:
> 
>  bla.c:28:23: note:   SLP size 13 vs. limit 36.
> bla.c:28:23: note:   Final SLP tree for instance 0x2de1eb60:
> bla.c:28:23: note:   node 0x2de8cb30 (max_nunits=1, refcnt=2) vector(4) int
> cycle 0, link 0
> bla.c:28:23: note:   op template: sum_35 = (int) _12;
> bla.c:28:23: note:      [l] stmt 0 sum_35 = (int) _12;
> bla.c:28:23: note:      [l] stmt 1 sum_35 = (int) _12;
> bla.c:28:23: note:      [l] stmt 2 sum_35 = (int) _12;
> bla.c:28:23: note:      [l] stmt 3 sum_35 = (int) _12;
> [...]
> bla.c:28:23: note:   Decided to SLP 1 instances. Unrolling factor 1
> 
> With "uint tmp[4][4]" (or before the patch) we get:
> bla.c:28:23: note:   SLP size 35 vs. limit 40.
> bla.c:28:23: note:   Final SLP tree for instance 0x36fa7c10:
> bla.c:28:23: note:   node 0x37015460 (max_nunits=4, refcnt=3) vector(4) int
> cycle 1, link 0
> bla.c:28:23: note:   op template: sum_39 = (int) _16;
> bla.c:28:23: note:      [l] stmt 0 sum_39 = (int) _16;
> bla.c:28:23: note:      children 0x37015510
> [...]
> bla.c:28:23: note:   Decided to SLP 1 instances. Unrolling factor 4
> 
> Of course it's difficult to anticipate that it'll get worse when starting the
> discovery but here the tree looks inferior.

It's an artifact on how we represent this.  The advantage of reduction
chains is a reduction in unroll factor and the reduction of the
effective chain length and with anything SLP-like, it can avoid
the cost of interleaving for loads.

[Bug target/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

Reply via email to