[Bug tree-optimization/115252] New: The SLP vectorizer failed to perform automatic vectorization on pixel_sub_wxh of x264
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252 Bug ID: 115252 Summary: The SLP vectorizer failed to perform automatic vectorization on pixel_sub_wxh of x264 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hkzhang455 at gmail dot com Target Milestone: --- Test case: (from https://github.com/mirror/x264/blob/master/common/dct.c) void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { for (int x = 0; x < 4; x++) diff[x + y * 4] = pix1[x] - pix2[x]; pix1 += 16; pix2 += 32; } } This is a simplified version, as the original code will inlined and some of the parameters are constant. When compiling the function with `-O3 -mavx2`, . But after that, the code in it should be vectorized When I compiled with `-O3 -mavx2/-msse4.2`, the inner loop will be unrolled and SLP vectorizer failed to vectorize it, and I got the following message when adding `-fopt-info-vec-all`. :6:21: optimized: loop vectorized using 8 byte vectors :6:21: optimized: loop versioned for vectorization because of possible aliasing :5:6: note: vectorized 1 loops in function. :5:6: note: * Analysis failed with vector mode V8SI :5:6: note: * The result for vector mode V32QI would be the same :5:6: note: * Re-trying analysis with vector mode V16QI :5:6: note: * Analysis failed with vector mode V16QI :5:6: note: * Re-trying analysis with vector mode V8QI :5:6: note: * Analysis failed with vector mode V8QI :5:6: note: * Re-trying analysis with vector mode V4QI :5:6: note: * Analysis failed with vector mode V4QI If I manually use the type declaration provided by `immintrin.h` to rewrite the code, the code is as follows (which I hope the SLP vectorizer to be able to do) void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]}; __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]}; __v4hi diff_v = pix1_v - pix2_v; *(long long *)(diff + y * 4) = (long long)diff_v; pix1 += 16; pix2 += 32; } } I raised this issue in Gcc mailling list already, and Biner gave some analysis, that is, pix1 and pix2 are both uint8_t type, and their iterations are scalar, so this issue will exist, but I still submit a bug here and hope to follow up.
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #13 from Hanke Zhang --- (In reply to Andrew Pinski from comment #12) > (In reply to Hanke Zhang from comment #11) > > But I have never seen this '_FORTIFY_SOURCE' before. So I'm a confused as > > well. And when I try gcc@11.4 built in the default ubuntu 22.04, it's the > > same. So I don't know how to describe now. Thanks for your help anyway. > > Well Ubuntu's compiler defaults to defining _FORTIFY_SOURCE while the > upstream GCC does not. Ubuntu's compiler also defaults to building PIE > applications too. Thanks a lot. The _FORTIFY_SOURCE maybe the problem then.
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #11 from Hanke Zhang --- (In reply to Andrew Pinski from comment #10) > The difference between the 2 is the costing of the __printf_chk/puts: > _FORTIFY_SOURCE case: > freq:0.20 size: 3 time:2.43 __printf_chk (1, "Object code generation not > active! Forgot to call quantum_objcode_start?\n"); > > vs without: > freq:0.20 size: 2 time:2.23 puts (&"Object code generation not active! > Forgot to call quantum_objcode_start?"[0]); But I have never seen this '_FORTIFY_SOURCE' before. So I'm a confused as well. And when I try gcc@11.4 built in the default ubuntu 22.04, it's the same. So I don't know how to describe now. Thanks for your help anyway.
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #8 from Hanke Zhang --- (In reply to Andrew Pinski from comment #5) > Add -save-temps and attach the resulting .i (or .ii) file. Thank you. I have attached it.
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #7 from Hanke Zhang --- Created attachment 56046 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56046=edit preprocessed source
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #4 from Hanke Zhang --- (In reply to Andrew Pinski from comment #3) > Oh I see the compiler you are testing with defaults with fortify turned on. > That is the difference. > Maybe also with pie turned on by default tlalso. > > Can you provide the full output of gcc -v and also the preprocessed source? The full output of gcc -v is shown in my description where you can check. And I known't get what the preprocessed source means here. The origin source C file is provided already.
[Bug ipa/111672] Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 --- Comment #2 from Hanke Zhang --- (In reply to Andrew Pinski from comment #1) > I cannot reproduce this on the trunk (or even in 12.3.0): > > Split point at BB 3 > header time: 1393.311190 header size: 33 > split time: 2.226400 split size: 2 > bbs: 3 > SSA names to pass: > Refused: split size is smaller than call overhead > found articulation at bb 7 > Split point at BB 7 > header time: 1395.537590 header size: 35 > split time: 0.00 split size: 0 > bbs: 7 > SSA names to pass: > Refused: split size is smaller than call overhead It's still the same bug in my place, and I'm trying to compile on another computer, and the same happens. Note that, my host is x86_64-linux-gnu. Here is part of my output infomation about function splitting optimization. gcc -O3 -flto -fdumo-tree-fnsplit test.c cat a-test.c.050t.fnsplit ;; Function printf (printf, funcdef_no=15, decl_uid=964, cgraph_uid=16, symbol_order=15) Not splitting: disregarding inline limits. __attribute__((artificial, gnu_inline, always_inline)) __attribute__((nonnull (1), format (printf, 1, 2))) int printf (const char * restrict __fmt) { int _4; [local count: 1073741824]: _4 = __printf_chk (1, __fmt_2(D), __builtin_va_arg_pack ()); return _4; } ;; Function test_split_write (test_split_write, funcdef_no=39, decl_uid=3184, cgraph_uid=40, symbol_order=43) Splitting function at: Split point at BB 3 header time: 1393.311190 header size: 33 split time: 2.428800 split size: 3 bbs: 3 SSA names to pass: ;; 1 loops found ;; ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 Introduced new external node (puts/53). Symbols to be put in SSA form { D.3222 } Incremental SSA update started at block: 0 Number of blocks in CFG: 5 Number of blocks to update: 4 ( 80%) ;; 1 loops found ;; ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 4 2 3 ;; 4 succs { 2 } ;; 2 succs { 3 } ;; 3 succs { 1 } int test_split_write.part.0 ()
[Bug c/111672] New: Inappropriate function splitting during pass_split_functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 Bug ID: 111672 Summary: Inappropriate function splitting during pass_split_functions Product: gcc Version: 12.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: hkzhang455 at gmail dot com Target Milestone: --- Created attachment 56034 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56034=edit example C file that can trigger inappropriate function splitting When the GCC compiler performs function splitting optimization, the shorter and closer path is split into a new function, while the remaining more complex and expensive code is retained, resulting in the complexity of the original function being increased after the split, and the split new function only performs simple operations (such as 'printf()'). You can compile the source code file I put in the attachment with the following command, and look at the gimple corresponding to the generated fnsplit to find the phenomenon I described. gcc test.c -O3 -flto -fdump-tree-fnsplit -Wall -Wextra Of course, this is only sample code, so the resulting executable does not reflect the efficiency gap due to the problem of inline. But in more complex code, efficiency decreases. Hardware: 12th Gen Intel(R) Core(TM) i9-12900KF System: Ubuntu 22.04 Output of `gcc -v`: Using built-in specs. COLLECT_GCC=xxx/install/bin/gcc COLLECT_LTO_WRAPPER=xxx/install/libexec/gcc/x86_64-pc-linux-gnu/12.3.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../configure --prefix=xxx/install --enable-threads=posix --disable-checking --disable-multilib --disable-bootstrap --enable-languages=c,c++,lto Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.3.0 (GCC)