https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91094

            Bug ID: 91094
           Summary: BB vectorization is too quick to disable itself
                    because of possible unrolling needed
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

long long a[1024];
int b[1024];

void foo()
{
  a[0] = b[0] + b[2];
  a[1] = b[1] + b[3];
#if WORKS
  a[2] = b[4] + b[6];
  a[3] = b[5] + b[7];
#endif
}


The above is not vectorized fully (we vectorize the store) if !WORKS because

t2.c:12:1: missed:   Build SLP failed: unrolling required in basic block SLP

which checks group_size (2) against the vector(4) int number of elements.
It works fine with WORKS because then group_size is 4.

A similar issue prevents SPEC x264 from being vectorized optimally.
Testcase from that:

typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

uint32_t tmp[4][4];
__attribute__ ((noinline,noclone))
void x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2
)
{
    uint32_t a0, a1, a2, a3;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
}

Reply via email to