https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277

--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Yes, it looks like the issue is that the single-lane SLP for a 4-element VLS
mode is worse than before when we couldn't handle the conversion.  Doesn't that
also affect other, VLS, targets?

A slightly reduced example:

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

inline uint32_t abs2( uint32_t a )
{
  return (a+a)^a;
}

int tmp[4][4];

int
x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }

    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

Build options:  -O3 -march=rv64gcv bla.c -mno-vector-strict-align
--param=riscv-autovec-mode=V4QI

With "int tmp[4][4]" the "new" SLP reduction succeeds and we get an SLP tree
that starts with the duplicated sum:

 bla.c:28:23: note:   SLP size 13 vs. limit 36.
bla.c:28:23: note:   Final SLP tree for instance 0x2de1eb60:
bla.c:28:23: note:   node 0x2de8cb30 (max_nunits=1, refcnt=2) vector(4) int
cycle 0, link 0
bla.c:28:23: note:   op template: sum_35 = (int) _12;
bla.c:28:23: note:      [l] stmt 0 sum_35 = (int) _12;
bla.c:28:23: note:      [l] stmt 1 sum_35 = (int) _12;
bla.c:28:23: note:      [l] stmt 2 sum_35 = (int) _12;
bla.c:28:23: note:      [l] stmt 3 sum_35 = (int) _12;
[...]
bla.c:28:23: note:   Decided to SLP 1 instances. Unrolling factor 1

With "uint tmp[4][4]" (or before the patch) we get:
bla.c:28:23: note:   SLP size 35 vs. limit 40.
bla.c:28:23: note:   Final SLP tree for instance 0x36fa7c10:
bla.c:28:23: note:   node 0x37015460 (max_nunits=4, refcnt=3) vector(4) int
cycle 1, link 0
bla.c:28:23: note:   op template: sum_39 = (int) _16;
bla.c:28:23: note:      [l] stmt 0 sum_39 = (int) _16;
bla.c:28:23: note:      children 0x37015510
[...]
bla.c:28:23: note:   Decided to SLP 1 instances. Unrolling factor 4

Of course it's difficult to anticipate that it'll get worse when starting the
discovery but here the tree looks inferior.

Reply via email to