[Bug tree-optimization/115252] New: The SLP vectorizer failed to perform automatic vectorization on pixel_sub_wxh of x264

2024-05-27 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252

Bug ID: 115252
   Summary: The SLP vectorizer failed to perform automatic
vectorization on pixel_sub_wxh of x264
   Product: gcc
   Version: 14.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hkzhang455 at gmail dot com
  Target Milestone: ---

Test case: (from https://github.com/mirror/x264/blob/master/common/dct.c) 

void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
  for (int y = 0; y < 4; y++) {
for (int x = 0; x < 4; x++)
  diff[x + y * 4] = pix1[x] - pix2[x];
pix1 += 16;
pix2 += 32;
  }
}

This is a simplified version, as the original code will inlined and some of the
parameters are constant.

When compiling the function with `-O3 -mavx2`, . But after that, the code in it
should be vectorized


When I compiled with `-O3 -mavx2/-msse4.2`, the inner loop will be unrolled and
SLP vectorizer failed to vectorize it, and I got the following message when
adding
`-fopt-info-vec-all`.

:6:21: optimized: loop vectorized using 8 byte vectors
:6:21: optimized:  loop versioned for vectorization because of
possible aliasing
:5:6: note: vectorized 1 loops in function.
:5:6: note: * Analysis failed with vector mode V8SI
:5:6: note: * The result for vector mode V32QI would be the same
:5:6: note: * Re-trying analysis with vector mode V16QI
:5:6: note: * Analysis failed with vector mode V16QI
:5:6: note: * Re-trying analysis with vector mode V8QI
:5:6: note: * Analysis failed with vector mode V8QI
:5:6: note: * Re-trying analysis with vector mode V4QI
:5:6: note: * Analysis failed with vector mode V4QI

If I manually use the type declaration provided by `immintrin.h` to
rewrite the code, the code is as follows (which I hope the SLP
vectorizer to be able to do)

void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
  for (int y = 0; y < 4; y++) {
__v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]};
__v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]};
__v4hi diff_v = pix1_v - pix2_v;
*(long long *)(diff + y * 4) = (long long)diff_v;
pix1 += 16;
pix2 += 32;
  }
}


I raised this issue in Gcc mailling list already, and Biner gave some analysis,
that is, pix1 and pix2 are both uint8_t type, and their iterations are scalar,
so this issue will exist, but I still submit a bug here and hope to follow up.

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-06 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #13 from Hanke Zhang  ---
(In reply to Andrew Pinski from comment #12)
> (In reply to Hanke Zhang from comment #11)
> > But I have never seen this '_FORTIFY_SOURCE' before. So I'm a confused as
> > well. And when I try gcc@11.4 built in the default ubuntu 22.04, it's the
> > same. So I don't know how to describe now. Thanks for your help anyway.
> 
> Well Ubuntu's compiler defaults to defining _FORTIFY_SOURCE while the
> upstream GCC does not. Ubuntu's compiler also defaults to building PIE
> applications too.

Thanks a lot. The _FORTIFY_SOURCE maybe the problem then.

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-05 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #11 from Hanke Zhang  ---
(In reply to Andrew Pinski from comment #10)
> The difference between the 2 is the costing of the __printf_chk/puts:
> _FORTIFY_SOURCE case:
>   freq:0.20 size:  3 time:2.43 __printf_chk (1, "Object code generation not
> active! Forgot to call quantum_objcode_start?\n");
> 
> vs without:
>   freq:0.20 size:  2 time:2.23 puts (&"Object code generation not active!
> Forgot to call quantum_objcode_start?"[0]);

But I have never seen this '_FORTIFY_SOURCE' before. So I'm a confused as well.
And when I try gcc@11.4 built in the default ubuntu 22.04, it's the same. So I
don't know how to describe now. Thanks for your help anyway.

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-04 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #8 from Hanke Zhang  ---
(In reply to Andrew Pinski from comment #5)
> Add -save-temps and attach the resulting .i (or .ii) file.

Thank you. I have attached it.

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-04 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #7 from Hanke Zhang  ---
Created attachment 56046
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56046=edit
preprocessed source

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-03 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #4 from Hanke Zhang  ---
(In reply to Andrew Pinski from comment #3)
> Oh I see the compiler you are testing with defaults with fortify turned on.
> That is the difference.
> Maybe also with pie turned on by default tlalso.
> 
> Can you provide the full output of gcc -v and also the preprocessed source?

The full output of gcc -v is shown in my description where you can check. And I
known't get what the preprocessed source means here. The origin source C file
is provided already.

[Bug ipa/111672] Inappropriate function splitting during pass_split_functions

2023-10-03 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

--- Comment #2 from Hanke Zhang  ---
(In reply to Andrew Pinski from comment #1)
> I cannot reproduce this on the trunk (or even in 12.3.0):
> 
> Split point at BB 3
>   header time: 1393.311190 header size: 33
>   split time: 2.226400 split size: 2
>   bbs: 3
>   SSA names to pass:
>   Refused: split size is smaller than call overhead
> found articulation at bb 7
> Split point at BB 7
>   header time: 1395.537590 header size: 35
>   split time: 0.00 split size: 0
>   bbs: 7
>   SSA names to pass:
>   Refused: split size is smaller than call overhead

It's still the same bug in my place, and I'm trying to compile on another
computer, and the same happens. Note that, my host is x86_64-linux-gnu. Here is
part of my output infomation about function splitting optimization.

  gcc -O3 -flto -fdumo-tree-fnsplit test.c
  cat a-test.c.050t.fnsplit

;; Function printf (printf, funcdef_no=15, decl_uid=964, cgraph_uid=16,
symbol_order=15)

Not splitting: disregarding inline limits.
__attribute__((artificial, gnu_inline, always_inline))
__attribute__((nonnull (1), format (printf, 1, 2)))
int printf (const char * restrict __fmt)
{
  int _4;

   [local count: 1073741824]:
  _4 = __printf_chk (1, __fmt_2(D), __builtin_va_arg_pack ());
  return _4;

}



;; Function test_split_write (test_split_write, funcdef_no=39, decl_uid=3184,
cgraph_uid=40, symbol_order=43)



Splitting function at:
Split point at BB 3
  header time: 1393.311190 header size: 33
  split time: 2.428800 split size: 3
  bbs: 3
  SSA names to pass:
;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1
Introduced new external node (puts/53).

Symbols to be put in SSA form
{ D.3222 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 5
Number of blocks to update: 4 ( 80%)


;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 4 2 3
;; 4 succs { 2 }
;; 2 succs { 3 }
;; 3 succs { 1 }
int test_split_write.part.0 ()

[Bug c/111672] New: Inappropriate function splitting during pass_split_functions

2023-10-03 Thread hkzhang455 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

Bug ID: 111672
   Summary: Inappropriate function splitting during
pass_split_functions
   Product: gcc
   Version: 12.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hkzhang455 at gmail dot com
  Target Milestone: ---

Created attachment 56034
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56034=edit
example C file that can trigger inappropriate function splitting

When the GCC compiler performs function splitting optimization, the shorter and
closer path is split into a new function, while the remaining more complex and
expensive code is retained, resulting in the complexity of the original
function being increased after the split, and the split new function only
performs simple operations (such as 'printf()').

You can compile the source code file I put in the attachment with the following
command, and look at the gimple corresponding to the generated fnsplit to find
the phenomenon I described.

  gcc test.c -O3 -flto -fdump-tree-fnsplit -Wall -Wextra

Of course, this is only sample code, so the resulting executable does not
reflect the efficiency gap due to the problem of inline. But in more complex
code, efficiency decreases.

Hardware: 12th Gen Intel(R) Core(TM) i9-12900KF
System: Ubuntu 22.04
Output of `gcc -v`:

Using built-in specs.
COLLECT_GCC=xxx/install/bin/gcc
COLLECT_LTO_WRAPPER=xxx/install/libexec/gcc/x86_64-pc-linux-gnu/12.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --prefix=xxx/install --enable-threads=posix
--disable-checking --disable-multilib --disable-bootstrap
--enable-languages=c,c++,lto
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.3.0 (GCC)