Hi Biener, Thanks for your help!
I have already open a bugreport here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252. Thanks Hanke Zhang Richard Biener <richard.guent...@gmail.com> 于2024年5月27日周一 21:14写道: > > On Sat, May 25, 2024 at 3:08 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > Hi, > > I'm trying to studing the automatic vectorization optimization in GCC, > > but I found one case that SLP vectorizer failed to do such things. > > > > Here is the sample code: (also a simplification version of a function > > from the 625/525.x264 source code in SPEC CPU 2017) > > > > void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { > > for (int y = 0; y < 4; y++) { > > for (int x = 0; x < 4; x++) > > diff[x + y * 4] = pix1[x] - pix2[x]; > > pix1 += 16; > > pix2 += 32; > > The issue is these increments, with only four uint8_t elements accessed > we still want to fill up a vectors worth of them. > > In the end we succeed with v4hi / v8qi but also peel for gaps even though > we handle the half-load case fine. > > > } > > } > > > > When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to > > vectorize it, and I got the following message when adding > > `-fopt-info-vec-all`. (The inner loop will be unrolled) > > > > <source>:6:21: optimized: loop vectorized using 8 byte vectors > > <source>:6:21: optimized: loop versioned for vectorization because of > > possible aliasing > > <source>:5:6: note: vectorized 1 loops in function. > > ^^^ > > so you do see the vectorization as outlined above. > > > <source>:5:6: note: ***** Analysis failed with vector mode V8SI > > <source>:5:6: note: ***** The result for vector mode V32QI would be the same > > <source>:5:6: note: ***** Re-trying analysis with vector mode V16QI > > <source>:5:6: note: ***** Analysis failed with vector mode V16QI > > <source>:5:6: note: ***** Re-trying analysis with vector mode V8QI > > <source>:5:6: note: ***** Analysis failed with vector mode V8QI > > <source>:5:6: note: ***** Re-trying analysis with vector mode V4QI > > <source>:5:6: note: ***** Analysis failed with vector mode V4QI > > > > If I manually use the type declaration provided by `immintrin.h` to > > rewrite the code, the code is as follows (which I hope the SLP > > vectorizer to be able to do) > > > > void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { > > for (int y = 0; y < 4; y++) { > > __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]}; > > __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]}; > > __v4hi diff_v = pix1_v - pix2_v; > > *(long long *)(diff + y * 4) = (long long)diff_v; > > We kind-of do it this way, just > > __v8qi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3], 0, 0, 0, 0}; > ... > > and then unpack __v8qi low to v4hi. > > And unfortunately the last two outer iterations are scalar because of the > gap issue. There's some PRs about this, I did start to work on improving > this, > I'm not sure this exact case is covered so can you open a new bugreport? > > > pix1 += 16; > > pix2 += 32; > > } > > } > > > > What I want to know is why SLP vectorizer can't vectorize the code > > here, and what changes do I need to make to SLP vectorizer or the > > source code if I want it to do so? > > > > Thanks > > Hanke Zhang