https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91094
Bug ID: 91094 Summary: BB vectorization is too quick to disable itself because of possible unrolling needed Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- long long a[1024]; int b[1024]; void foo() { a[0] = b[0] + b[2]; a[1] = b[1] + b[3]; #if WORKS a[2] = b[4] + b[6]; a[3] = b[5] + b[7]; #endif } The above is not vectorized fully (we vectorize the store) if !WORKS because t2.c:12:1: missed: Build SLP failed: unrolling required in basic block SLP which checks group_size (2) against the vector(4) int number of elements. It works fine with WORKS because then group_size is 4. A similar issue prevents SPEC x264 from being vectorized optimally. Testcase from that: typedef unsigned int uint32_t; typedef unsigned char uint8_t; #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ int t0 = s0 + s1;\ int t1 = s0 - s1;\ int t2 = s2 + s3;\ int t3 = s2 - s3;\ d0 = t0 + t2;\ d2 = t0 - t2;\ d1 = t1 + t3;\ d3 = t1 - t3;\ } uint32_t tmp[4][4]; __attribute__ ((noinline,noclone)) void x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { uint32_t a0, a1, a2, a3; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); } }