[Bug target/116582] gather is a win in some cases on zen CPUs

2024-09-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #6 from Jan Hubicka  ---
Here is a variant of benchmark that needs masking

#include 
#define M 1024*1024
T a[M], b[M];
int indices[M];
char c[M];
__attribute__ ((noipa))
void
test ()
{
  for (int i = 0; i < 1024* 16; i++)
if (c[i])
  a[i] += b[indices[i]];
}
int
main()
{
  for (int i = 0 ; i < M; i++)
{
  indices[i] = rand () % M;
  c[i] = rand () % 2;
}
  for (int i = 0 ; i < 1; i++)
test ();
  return 0;
}



jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native cnd.c 
-Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details  ; objdump -d
a.out | grep gather ; perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

281.03 msec task-clock:u #0.999 CPUs
utilized   ( +-  0.62% )
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   659  page-faults:u#2.345 K/sec  
( +-  0.06% )
 1,156,011,975  cycles:u #4.113 GHz
( +-  0.65% )
   757,216,769  stalled-cycles-frontend:u#   65.50% frontend
cycles idle( +-  1.59% )
 1,292,982,312  instructions:u   #1.12  insn per
cycle
  #0.59  stalled cycles per
insn ( +-  0.00% )
   360,669,069  branches:u   #1.283 G/sec  
( +-  0.00% )
   118,731  branch-misses:u  #0.03% of all
branches ( +-  8.51% )

   0.28126 +- 0.00173 seconds time elapsed  ( +-  0.62% )

jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native cnd.c 
-Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details  ; objdump -d
a.out | grep gather ; perf stat -r 10 ./a.out
  401241:   62 f2 7d 4d 92 1c 8dvgatherdps
0x904080(,%zmm1,4),%zmm3{%k5}
  40125b:   62 f2 7d 4e 92 14 8dvgatherdps
0x904080(,%zmm1,4),%zmm2{%k6}
  40126a:   62 f2 7d 4f 92 0c a5vgatherdps
0x904080(,%zmm4,4),%zmm1{%k7}
  401280:   62 f2 7d 4d 92 2c a5vgatherdps
0x904080(,%zmm4,4),%zmm5{%k5}

 Performance counter stats for './a.out' (10 runs):

266.73 msec task-clock:u #0.999 CPUs
utilized   ( +-  4.31% )
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   659  page-faults:u#2.471 K/sec  
( +-  0.05% )
 1,097,343,324  cycles:u #4.114 GHz
( +-  4.33% )
 4,009,606  stalled-cycles-frontend:u#0.37% frontend
cycles idle( +-  6.91% )
   241,592,306  instructions:u   #0.22  insn per
cycle
  #0.02  stalled cycles per
insn ( +-  0.00% )
35,549,063  branches:u   #  133.279 M/sec  
( +-  0.00% )
92,191  branch-misses:u  #0.26% of all
branches ( +-  0.06% )

0.2670 +- 0.0115 seconds time elapsed  ( +-  4.30% )
so the difference in number of cycles is quite small while frontend works much
harder without gether.  If c array is ocnstnat 1:

jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native cnd.c 
-Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details  ; objdump -d
a.out | grep gather ; perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

520.92 msec task-clock:u #1.000 CPUs
utilized   ( +-  5.29% )
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   659  page-faults:u#1.265 K/sec  
( +-  0.04% )
 2,142,512,947  cycles:u #4.113 GHz
( +-  5.31% )
   137,707,449  stalled-cycles-frontend:u#6.43% frontend
cycles idle( +- 94.67% )
 1,553,801,640  instructions:u   #0.73  insn per
cycle
  #0.09  stalled cycles per
insn ( +-  0.00% )
   344,940,506  branches:u   #  662.177 M/sec  
( +

[Bug other/85716] No easy way for end-user to tell what GCC is doing when compilation is slow

2024-09-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85716

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #16 from Jan Hubicka  ---
For LTO linking we do have some idea about progress during ltrans since we
compute estimated sizes of functions and we know the size of whole unit we
build.  WPA stage can at least be divided into few steps (i.e. streaming in
where we know size of input files, inlining and stream out)

[Bug target/116582] gather is a win in some cases on zen CPUs

2024-09-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #3 from Jan Hubicka  ---
Just for completeness the codegen for parest sparse matrix multiply is:

  0.31 │320:   kmovb %k1,%k4
  0.25 │   kmovb %k1,%k5
  0.28 │   vmovdqu32 (%rcx,%rax,1),%zmm0
  0.32 │   vpmovzxdq %ymm0,%zmm4
  0.31 │   vextracti32x8 $0x1,%zmm0,%ymm0
  0.48 │   vpmovzxdq %ymm0,%zmm0
 10.32 │   vgatherqpd(%r14,%zmm4,8),%zmm2{%k4}
  1.90 │   vfmadd231pd   (%rdx,%rax,2),%zmm2,%zmm1
 14.86 │   vgatherqpd(%r14,%zmm0,8),%zmm5{%k5}   
  0.27 │   vfmadd231pd   0x40(%rdx,%rax,2),%zmm5,%zmm1
  0.26 │   add   $0x40,%rax
  0.23 │   cmp   %rax,%rdi   
   │ ↑ jne   320 

which looks OK to me.

[Bug target/116582] gather is a win in some cases on zen CPUs

2024-09-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #2 from Jan Hubicka  ---
it is mysterious.  I was looking into why in some cases the gather is a win in
micro-benchmark and loss in real benchmark. Indeed distribution of indices
makes difference.

If I make indices random then the performance effect is neutral:

jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

454.77 msec task-clock:u #0.999 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   663  page-faults:u#1.458 K/sec   
 1,854,500,227  cycles:u #4.078 GHz 
 4,788,337  stalled-cycles-frontend:u#0.26% frontend
cycles idle  
   651,597,070  instructions:u   #0.35  insn per
cycle
  #0.01  stalled cycles per
insn   
58,222,408  branches:u   #  128.027 M/sec   
60,269  branch-misses:u  #0.10% of all
branches   

   0.455155383 seconds time elapsed

   0.455154000 seconds user
   0.0 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401212:   62 f2 7d 4a 92 04 8dvgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out':

448.84 msec task-clock:u #0.999 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   663  page-faults:u#1.477 K/sec   
 1,834,437,666  cycles:u #4.087 GHz 
 4,522,424  stalled-cycles-frontend:u#0.25% frontend
cycles idle  
   160,137,040  instructions:u   #0.09  insn per
cycle
  #0.03  stalled cycles per
insn   
27,502,394  branches:u   #   61.274 M/sec   
60,328  branch-misses:u  #0.22% of all
branches   

   0.449240415 seconds time elapsed

   0.449224000 seconds user
   0.0 seconds sys


If I make stride 8 then it is a win:
#include 
#define M 1024*1024
int indices[M];
T a[M], b[M];
__attribute__ ((noipa))
void
test ()
{
  for (int i = 0; i < 1024* 16; i++)
a[i] += b[indices[i]];
}
int
main()
{
  for (int i = 0 ; i < M; i++)
indices[i] = (i * 8)%M;
  for (int i = 0 ; i < 1; i++)
test ();
  return 0;
}

jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

  5,827.78 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   222  page-faults:u#   38.093 /sec
23,975,482,386  cycles:u #4.114 GHz 
   784,362,546  stalled-cycles-frontend:u#3.27% frontend
cycles idle  
   576,680,806  instructions:u   #0.02  insn per
cycle
  #1.36  stalled cycles per
insn   
41,523,290  branches:u   #7.125 M/sec   
53,461  branch-misses:u  #0.13% of all
branches   

   5.828522527 seconds time elapsed

   5.828224000 seconds user
   0.0 seconds sys


jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float  -march=native
gather.c  -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401252:   62 f2 7d 4a 92 04 8dvgatherdps
0x404080(,%zmm1,4),%zmm0{%k2}

 Performance counter stats for './a.out'

[Bug middle-end/116582] New: gather is a win in some cases on zen CPUs

2024-09-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

Bug ID: 116582
   Summary: gather is a win in some cases on zen CPUs
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

While the sparse multiply in parest and tsvc does not seem to work well gather
the following benchmark likes it:

T a[M], b[M];
__attribute__ ((noipa))
void
test ()
{
  for (int i = 0; i < 1024* 16; i++)
a[i] += b[indices[i]];
}
int
main()
{
  for (int i = 0 ; i < M; i++)
indices[i] = (i * 8) % M;
  for (int i = 0 ; i < 1; i++)
test ();
  return 0;
}

jan@localhost:/tmp> g++ -DT=float  -march=native gather.c  -Ofast
-mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts
-mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d
a.out | grep gather ; perf stat ./a.out

 Performance counter stats for './a.out':

  3,499.60 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   221  page-faults:u#   63.150 /sec
14,526,193,995  cycles:u #4.151 GHz 
   467,072,127  stalled-cycles-frontend:u#3.22% frontend
cycles idle  
   577,324,069  instructions:u   #0.04  insn per
cycle
  #0.81  stalled cycles per
insn   
41,578,204  branches:u   #   11.881 M/sec   
50,517  branch-misses:u  #0.12% of all
branches   

   3.500660600 seconds time elapsed

   3.49715 seconds user
   0.00000 seconds sys


jan@localhost:/tmp> g++ -DT=float  -march=native gather.c  -Ofast
-mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts
-mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out
| grep gather ; perf stat ./a.out
  401250:   c4 e2 65 92 04 8d 40vgatherdps
%ymm3,0x404040(,%ymm1,4),%ymm0

 Performance counter stats for './a.out':

  1,263.87 msec task-clock:u #0.922 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   222  page-faults:u#  175.651 /sec
 5,172,067,789  cycles:u #4.092 GHz 
93,135,962  stalled-cycles-frontend:u#1.80% frontend
cycles idle  
   167,783,419  instructions:u   #0.03  insn per
cycle
  #0.56  stalled cycles per
insn   
21,097,560  branches:u   #   16.693 M/sec   
24,253  branch-misses:u  #0.11% of all
branches   

   1.370533592 seconds time elapsed

   1.265143000 seconds user
   0.0 seconds sys


Non-gather loop is:
.L2:
movslq  indices(%rax), %rcx
movslq  indices+8(%rax), %rdi
addq$16, %rax
movslq  indices-12(%rax), %rdx
movslq  indices-4(%rax), %rsi
vmovss  b(,%rdi,4), %xmm1
vmovss  b(,%rcx,4), %xmm0
vinsertps   $0x10, b(,%rsi,4), %xmm1, %xmm1
vinsertps   $0x10, b(,%rdx,4), %xmm0, %xmm0
vmovlhps%xmm1, %xmm0, %xmm0
vaddps  a-16(%rax), %xmm0, %xmm0
vmovaps %xmm0, a-16(%rax)
cmpq$65536, %rax

while gather loop:

.L2:
vmovdqa indices(%rax), %ymm1
vmovaps %ymm2, %ymm3
addq$32, %rax
vgatherdps  %ymm3, b(,%ymm1,4), %ymm0
vaddps  a-32(%rax), %ymm0, %ymm0
vmovaps %ymm0, a-32(%rax)
cmpq$65536, %rax
jne .L2

[Bug ipa/116296] [13/14/15 Regression] internal compiler error: in merge, at ipa-modref-tree.cc:176 at -O3

2024-08-12 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116296

--- Comment #2 from Jan Hubicka  ---
It is most likely some problem with computing bit offsets for the alias oracle.
I guess multiplying that number by sizeof (long) * 11 * 11 * 8 triggers
overflow.

Probably harmless for -fdisable-checking generated code since that access
should be undefined behaviour then.
I will take a look.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

Jan Hubicka  changed:

   What|Removed |Added

   Last reconfirmed||2024-08-01
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #2 from Jan Hubicka  ---
Looking at the change, I do not see how that could disable inlining. It should
only reduce size of the function size estimates in the heuristics.

I think it is more likely loop optimization doing something crazy.  But we need
to figure out what really changed in the codegen.

[Bug ipa/109914] --suggest-attribute=pure misdiagnoses static functions

2024-07-29 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109914

--- Comment #7 from Jan Hubicka  ---
The idea is to help developers to annotate i.e. binary tree search function,
which he clearly knows is always to be finite, but compiler can not prove it.
Intentional infinite loops with no side effects written in a convoluted ways
are almost never intentional, so almost always developer can add the pure
attribute based on his/her understanding of what the code really does.

[Bug ipa/116055] [14/15 Regression] ICE from gcc.c-torture/unsorted/dump-noaddr.c after "Fix modref's iteraction with store merging"

2024-07-29 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116055

Jan Hubicka  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Jan Hubicka  ---
Fixed.

[Bug ipa/116055] [14/15 Regression] ICE from gcc.c-torture/unsorted/dump-noaddr.c after "Fix modref's iteraction with store merging"

2024-07-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116055

--- Comment #4 from Jan Hubicka  ---
This does not reproduce for me (with trunk nor gcc14 build with
--target=powerpc64le-linux-gnu)

However the problem is almost surely sanity check in dumping code that flags
does not get worse (which they can now thanks to store merging)

gcc/ChangeLog:

* ipa-modref.cc (analyze_function): Do not ICE when flags regress.

diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc
index f6a758b5f42..59cfe91f987 100644
--- a/gcc/ipa-modref.cc
+++ b/gcc/ipa-modref.cc
@@ -3297,7 +3297,8 @@ analyze_function (bool ipa)
fprintf (dump_file, "  Flags for param %i improved:",
 (int)i);
  else
-   gcc_unreachable ();
+   fprintf (dump_file, "  Flags for param %i changed:",
+(int)i);
  dump_eaf_flags (dump_file, old_flags, false);
  fprintf (dump_file, " -> ");
  dump_eaf_flags (dump_file, new_flags, true);
@@ -3313,7 +3314,7 @@ analyze_function (bool ipa)
  || (summary->retslot_flags & EAF_UNUSED))
fprintf (dump_file, "  Flags for retslot improved:");
  else
-   gcc_unreachable ();
+   fprintf (dump_file, "  Flags for retslot changed:");
  dump_eaf_flags (dump_file, past_retslot_flags, false);
  fprintf (dump_file, " -> ");
  dump_eaf_flags (dump_file, summary->retslot_flags, true);
@@ -3328,7 +3329,7 @@ analyze_function (bool ipa)
  || (summary->static_chain_flags & EAF_UNUSED))
fprintf (dump_file, "  Flags for static chain improved:");
  else
-   gcc_unreachable ();
+   fprintf (dump_file, "  Flags for static chain changed:");
  dump_eaf_flags (dump_file, past_static_chain_flags, false);
  fprintf (dump_file, " -> ");
  dump_eaf_flags (dump_file, summary->static_chain_flags, true);

Does it help?

[Bug ipa/106783] [12/13/14/15 Regression] ICE in ipa-modref.cc:analyze_function since r12-5247-ga34edf9a3e907de2

2024-07-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783

--- Comment #6 from Jan Hubicka  ---
The problem is that n/=0 is undefined behavior (so we can optimize out call to
function doing divide by zero), while __builtin_trap is observable and we do
not optimize out code paths that may trip to it.

so isolate-paths is de-facto pesimizing code from this POV.  It is used
__builtin_unreachable things would work.  I think some parts of compiler use
__builtin_unreachable (such as loop unrolling) other __builtin_trap.  It would
be nice to have consistent solution to this.

[Bug tree-optimization/109985] [12/13/14 Regression] __builtin_prefetch ignored by GCC 12/13

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109985

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13/14 Regression]
   |__builtin_prefetch ignored  |__builtin_prefetch ignored
   |by GCC 12/13|by GCC 12/13

--- Comment #10 from Jan Hubicka  ---
Fxied on trunk

[Bug ipa/113907] [12/13 regression] ICU miscompiled on x86 since r14-5109-ga291237b628f41

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 regression]|[12/13 regression] ICU
   |ICU miscompiled on x86  |miscompiled on x86 since
   |since   |r14-5109-ga291237b628f41
   |r14-5109-ga291237b628f41|

--- Comment #82 from Jan Hubicka  ---
All wrong code issues i know of are now fixed on 14/15

[Bug ipa/111613] [12/13 Regression] Bit field stores can be incorrectly optimized away when -fstore-merging is in effect since r12-5383-g22c242342e38eb

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111613

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13 Regression] Bit
   |Bit field stores can be |field stores can be
   |incorrectly optimized away  |incorrectly optimized away
   |when -fstore-merging is in  |when -fstore-merging is in
   |effect since|effect since
   |r12-5383-g22c242342e38eb|r12-5383-g22c242342e38eb

--- Comment #9 from Jan Hubicka  ---
Fixed on 14/15

[Bug ipa/114207] [12/13 Regression] modref gets confused by vectorized code `-O3 -fno-tree-forwprop` since r12-5439

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13 Regression] modref
   |modref gets confused by |gets confused by vectorized
   |vectorized code `-O3|code `-O3
   |-fno-tree-forwprop` since   |-fno-tree-forwprop` since
   |r12-5439|r12-5439

--- Comment #8 from Jan Hubicka  ---
Fixed on 14/15 so far

[Bug ipa/115033] [12/13 Regression] Incorrect optimization of by-reference closure fields by fre1 pass since r12-5113-gd70ef65692fced

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115033

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13 Regression]
   |Incorrect optimization of   |Incorrect optimization of
   |by-reference closure fields |by-reference closure fields
   |by fre1 pass since  |by fre1 pass since
   |r12-5113-gd70ef65692fced|r12-5113-gd70ef65692fced

--- Comment #22 from Jan Hubicka  ---
Fixed on 14/15 so far

[Bug ipa/113291] [14/15 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Jan Hubicka  ---
Fixed.

[Bug middle-end/115277] [13 regression] ICF needs to match loop bound estimates

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277

Jan Hubicka  changed:

   What|Removed |Added

Summary|[13/14/15 regression] ICF   |[13 regression] ICF needs
   |needs to match loop bound   |to match loop bound
   |estimates   |estimates

--- Comment #7 from Jan Hubicka  ---
Fixed on 14/15 so far

[Bug ipa/111613] [12/13/14/15 Regression] Bit field stores can be incorrectly optimized away when -fstore-merging is in effect since r12-5383-g22c242342e38eb

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111613

--- Comment #7 from Jan Hubicka  ---
I suppose there is not much to do about past noread flags. I do not see how
optimization can invalidate other properties, so I am testing the following:

diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc
index f994388a96a..53a2e35133d 100644
--- a/gcc/ipa-modref.cc
+++ b/gcc/ipa-modref.cc
@@ -3004,6 +3004,9 @@ analyze_parms (modref_summary *summary,
modref_summary_lto *summary_lto,
 (past, ecf_flags,
  VOID_TYPE_P (TREE_TYPE
  (TREE_TYPE (current_function_decl;
+ /* Store merging can produce reads when combining together multiple
+bitfields.  See PR111613.  */
+ past &= ~(EAF_NO_DIRECT_READ | EAF_NO_INDIRECT_READ);
  if (dump_file && (flags | past) != flags && !(flags & EAF_UNUSED))
{
  fprintf (dump_file,

[Bug tree-optimization/114207] [12/13/14/15 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207

--- Comment #5 from Jan Hubicka  ---
The offset gets lost in ipa-prop.cc

diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
index 7d7cb3835d2..99ebd6229ec 100644
--- a/gcc/ipa-prop.cc
+++ b/gcc/ipa-prop.cc
@@ -1370,9 +1370,9 @@ unadjusted_ptr_and_unit_offset (tree op, tree *ret,
poly_int64 *offset_ret)
 {
   if (TREE_CODE (op) == ADDR_EXPR)
{
- poly_int64 extra_offset = 0;
+ poly_int64 extra_offset;
  tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
-&offset);
+&extra_offset);
  if (!base)
{
  base = get_base_address (TREE_OPERAND (op, 0));

here offset is the offset being tracked and get_addr_base_and_unit_offset is
intended to initialize extra_offset which is later added to offset.

In the testcase the pointer is first offseted by +4 and later by -4 which
combines to 0.

[Bug ipa/115033] [12/13/14/15 Regression] Incorrect optimization of by-reference closure fields by fre1 pass since r12-5113-gd70ef65692fced

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115033

--- Comment #18 from Jan Hubicka  ---
modref_eaf_analysis::analyze_ssa_name misinterprets EAF flags.  If dereferenced
parameter is passed (to map_iterator in the testcase) it can be returned
indirectly which in turn makes it to escape into the next function call.

I am testing:

diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc
index a5adce8ea39..a4e3cc34b4d 100644
--- a/gcc/ipa-modref.cc
+++ b/gcc/ipa-modref.cc
@@ -2571,8 +2571,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name, bool
deferred)
int call_flags = deref_flags
(gimple_call_arg_flags (call, i), ignore_stores);
if (!ignore_retval && !(call_flags & EAF_UNUSED)
-   && !(call_flags & EAF_NOT_RETURNED_DIRECTLY)
-   && !(call_flags & EAF_NOT_RETURNED_INDIRECTLY))
+   && !(call_flags & (EAF_NOT_RETURNED_DIRECTLY ||
EAF_NOT_RETURNED_INDIRECTLY)))
  merge_call_lhs_flags (call, i, name, false, true);
if (ecf_flags & (ECF_CONST | ECF_NOVOPS))
  m_lattice[index].merge_direct_load ();

[Bug lto/114501] [12/13/14/15 Regression] ICE during lto streaming

2024-07-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114501

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13/14/15 Regression]
   |ICE during modref with LTO  |ICE during lto streaming
 CC||hubicka at gcc dot gnu.org
  Component|ipa |lto

--- Comment #11 from Jan Hubicka  ---
Note that this is not modref related - it is just last pass run before
streaming. We miss some free lang data I guess. Will take a look

[Bug ipa/67051] symtab_node::equal_address_to too conservative?

2024-06-04 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67051

--- Comment #2 from Jan Hubicka  ---
I believe that there was some discussion on this in the past.  I would be quite
happy to change the predicate to be more aggressive. Current code basically
duplicates what original fold-const.c did.

One problem is that we have no way to declare in header that one symbol is
alias of another while being defined in other translation unit.

jan@localhost:/tmp> cat t.c
extern int a;
extern int b __attribute ((alias("a")));
jan@localhost:/tmp> gcc t.c
t.c:2:12: error: ‘b’ aliased to undefined symbol ‘a’
2 | extern int b __attribute ((alias("a")));
  |^
jan@localhost:/tmp> clang t.c
t.c:2:28: error: alias must point to a defined variable or function
2 | extern int b __attribute ((alias("a")));
  |^
t.c:2:28: note: the function or variable specified in an alias must refer to
its mangled name
1 error generated.

So if one wants to use aliases intentionally (to do something smart about
superposing) then basically only valid testcases would be if translation units
never use both names together.

Also folding is done early when alias may not be declared yet, but that can be
solved by check for symtab state.

[Bug middle-end/115277] [13/14/15 regression] ICF needs to match loop bound estimates

2024-05-29 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277

Jan Hubicka  changed:

   What|Removed |Added

Summary|ICF needs to match loop |[13/14/15 regression] ICF
   |bound estimates |needs to match loop bound
   ||estimates

--- Comment #1 from Jan Hubicka  ---
Reproduces on 14 and trunk. GCC 12 is not able to determine the loop bound
during early optimizations

[Bug middle-end/115277] New: ICF needs to match loop bound estimates

2024-05-29 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277

Bug ID: 115277
   Summary: ICF needs to match loop bound estimates
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

jan@localhost:/tmp> cat tt.c
int array[1000];
void
test (int a)
{
if (__builtin_expect (a > 3, 1))
return;
for (int i = 0; i < a; i++)
array[i]=i;
}
void
test2 (int a)
{
if (__builtin_expect (a > 10, 1))
return;
for (int i = 0; i < a; i++)
array[i]=i;
}
int
main()
{
test(1);
test(2);
test(3);
test2(10);
if (array[9] != 9)
__builtin_abort ();
return 0;
}
jan@localhost:/tmp> gcc -O2 tt.c ; ./a.out
jan@localhost:/tmp> gcc -O3 tt.c ; ./a.out
Aborted (core dumped)


The problem here is that we do not match value ranges and thus we can end up
with different estimates on number of iterations.

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-05-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13/14 Regression] Wrong
   |Wrong code at -O with   |code at -O with ipa-modref
   |ipa-modref on aarch64   |on aarch64

--- Comment #22 from Jan Hubicka  ---
Fixed on trunk so far

[Bug libstdc++/109442] Dead local copy of std::vector not removed from function

2024-05-11 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109442

--- Comment #19 from Jan Hubicka  ---
Note that the testcase from PR115037 also shows that we are not able to
optimize out dead stores to the vector, which is another quite noticeable
problem.

void
test()
{
std::vector test;
test.push_back (1);
}

We alocate the block, store 1 and immediately delete it.
void test ()
{
  int * test$D25839$_M_impl$D25146$_M_start;
  struct vector test;
  int * _61;

   [local count: 1073741824]:
  _61 = operator new (4);

   [local count: 1063439392]:
  *_61 = 1;
  operator delete (_61, 4);
  test ={v} {CLOBBER};
  test ={v} {CLOBBER(eol)};
  return;

   [count: 0]:
:
  test ={v} {CLOBBER};
  resx 2

}

So my understanding is that we decided to not optimize away the dead stores
since the particular operator delete does not pass test:

  /* If the call is to a replaceable operator delete and results
 from a delete expression as opposed to a direct call to
 such operator, then we can treat it as free.  */
  if (fndecl
  && DECL_IS_OPERATOR_DELETE_P (fndecl)
  && DECL_IS_REPLACEABLE_OPERATOR (fndecl)
  && gimple_call_from_new_or_delete (stmt))
return ". o ";

This is because we believe that operator delete may be implemented in an insane
way that inspects the values stored in the block being freed.

I can sort of see that one can write standard conforming code that allocates
some data that is POD and inspects it in destructor.
However for std::vector this argument is not really applicable. Standard does
specify that new/delete is used to allocate/deallocate the memory but does not
say how the memory is organized or what happens before deallocation.
(i.e. it is probably valid for std::vector to memset the block just before
deallocating it).

Similar argument can IMO be used for eliding unused memory allocations. It is
kind of up to std::vector implementation on how many allocations/deallocations
it does, right?

So we need a way to annotate the new/delete calls in the standard library as
safe for such optimizations (i.e. implement clang's
__bulitin_operator_new/delete?)

How clang manages to optimize this out without additional hinting?

[Bug middle-end/115037] Unused std::vector is not optimized away.

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037

Jan Hubicka  changed:

   What|Removed |Added

 CC||jason at redhat dot com,
   ||jwakely at redhat dot com

--- Comment #2 from Jan Hubicka  ---
I tried to look for duplicates, but did not find one.
However I think the first problem is that we do not optimize away the store of
1 to vector while clang does.  I think this is because we do not believe we can
trust that delete operator is safe?

We get:
void test ()
{
  int * test$D25839$_M_impl$D25146$_M_start;
  struct vector test;
  int * _61;

   [local count: 1073741824]:
  _61 = operator new (4);

   [local count: 1063439392]:
  *_61 = 1;
  operator delete (_61, 4);
  test ={v} {CLOBBER};
  test ={v} {CLOBBER(eol)};
  return;

   [count: 0]:
:
  test ={v} {CLOBBER};
  resx 2

}
If we can not trust fact that operator delete is good, perhaps we can arrange
explicit clobber before calling it? I think it is up to std::vector to decide
what it will do with the stored array so in this case even instane oprator
delete has no right to expect that the data in vector will be sane :)

[Bug middle-end/115037] New: Unused std::vector is not optimized away.

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037

Bug ID: 115037
   Summary: Unused std::vector is not optimized away.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Compiling 
#include 
void
test()
{
std::vector test;
test.push_back (1);
}

leads to

_Z4testv:
.LFB1253:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
movl$4, %edi
call_Znwm
movl$4, %esi
movl$1, (%rax)
movq%rax, %rdi
addq$8, %rsp
.cfi_def_cfa_offset 8
jmp _ZdlPvm

while clang optimizes to:

_Z4testv:   # @_Z4testv
.cfi_startproc
# %bb.0:
retq

[Bug middle-end/115036] New: division is not shortened based on value range

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115036

Bug ID: 115036
   Summary: division is not shortened based on value range
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

For
long test(long a, long b)
{
if (a > 65535 || a < 0)
__builtin_unreachable ();
if (b > 65535 || b < 0)
__builtin_unreachable ();
return a/b;
}

we produce
test:
.LFB0:
.cfi_startproc
movq%rdi, %rax
cqto
idivq   %rsi
ret

while clang does:

test:   # @test
.cfi_startproc
# %bb.0:
movq%rdi, %rax
# kill: def $ax killed $ax killed $rax
xorl%edx, %edx
divw%si
movzwl  %ax, %eax
retq

clang also by default adds 32bit divide path even when value range is not known

long test(long a, long b)
{
return a/b;
}

compiles as

test:   # @test
.cfi_startproc
# %bb.0:
movq%rdi, %rax
movq%rdi, %rcx
orq %rsi, %rcx
shrq$32, %rcx
je  .LBB0_1
# %bb.2:
cqto
idivq   %rsi
retq

[Bug ipa/114985] [15 regression] internal compiler error: in discriminator_fail during stage2

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114985

--- Comment #14 from Jan Hubicka  ---
So this is problem in ipa_value_range_from_jfunc?
It is Maritn's code, I hope he will know why types are wrong here.
Once can get type compatibility problem on mismatched declarations and LTO, but
it seems that this testcase is single-file. So indeed this looks like a bug
either in jump function construction or even earlier...

[Bug middle-end/114852] New: jpegxl 10.0.1 is faster with clang18 then with gcc14

2024-04-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114852

Bug ID: 114852
   Summary: jpegxl 10.0.1 is faster with clang18 then with gcc14
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3
reports about 8% difference.  I can measure 13% on zen3.  The code has changed
and it is no longer bound by push_back but runs AVX2 version of inner loops.

The hottest loops looks comparable

  0.00 │266:┌─→vmovaps  (%r14,%rax,4),%ymm0
  0.11 ││  vmulps   (%rcx,%rax,4),%ymm7,%ymm2
  1.18 ││  vfnmadd213ps (%rsi,%rax,4),%ymm11,%ymm0
  0.25 ││  vmulps   %ymm2,%ymm0,%ymm0
  5.94 ││  vroundps $0x8,%ymm0,%ymm2
  0.35 ││  vsubps   %ymm2,%ymm0,%ymm0
  1.05 ││  vmulps   (%rdx,%rax,4),%ymm0,%ymm0
  3.19 ││  vmovaps  %ymm0,0x0(%r13,%rax,4)
  0.15 ││  vandps   %ymm10,%ymm2,%ymm0
  0.03 ││  add  $0x8,%rax
  0.03 ││  vcmpeqps %ymm8,%ymm0,%ymm2
  0.09 ││  vsqrtps  %ymm0,%ymm0
 27.25 ││  vaddps   %ymm0,%ymm6,%ymm6
  0.35 ││  vandnps  %ymm9,%ymm2,%ymm0
  0.12 ││  vaddps   %ymm0,%ymm5,%ymm5
  0.05 │├──cmp  %r12,%rax
  0.02 │└──jb   266

and clang

  0.00 │ c90:┌─→vmulps   (%r9,%rdx,4),%ymm0,%ymm2
  0.97 │ │  vmovaps  (%r15,%rdx,4),%ymm1
  0.36 │ │  vsubps   %ymm2,%ymm1,%ymm1
  4.24 │ │  vmulps   (%rcx,%rdx,4),%ymm4,%ymm2
  1.92 │ │  vmulps   %ymm2,%ymm1,%ymm1
  0.65 │ │  vroundps $0x8,%ymm1,%ymm2
  0.06 │ │  vsubps   %ymm2,%ymm1,%ymm1
  1.11 │ │  vmulps   (%rax,%rdx,4),%ymm1,%ymm1
  3.53 │ │  vmovaps  %ymm1,(%rsi,%rdx,4)
  0.68 │ │  vandps   %ymm6,%ymm2,%ymm1
  0.23 │ │  vcmpneqps%ymm5,%ymm2,%ymm2
  3.64 │ │  add  $0x8,%rdx
  0.24 │ │  vsqrtps  %ymm1,%ymm1
 22.16 │ │  vaddps   %ymm1,%ymm8,%ymm8
  0.25 │ │  vbroadcastss 0x31eba5(%rip),%ymm1# 34f840

  0.05 │ │  vandps   %ymm1,%ymm2,%ymm1
  0.04 │ │  vaddps   %ymm1,%ymm7,%ymm7
  0.11 │ ├──cmp  %rdi,%rdx
  0.07 │ └──jb   c90▒

GCC profile:
  10.78%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::EstimateEntropy(jxl::AcStrategy const&, float, unsigned long,
unsigned long, jxl::ACSConfig const&, float con
   7.02%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::FindBestMultiplier(float const*, float const*, unsigned long,
float, float, bool) [clone .part.0]
   4.50%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::Symmetric5Row(jxl::Plane const&,
jxl::RectT const&, long, jxl:
   4.47%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::TransformFromPixels(jxl::AcStrategy::Type,
float const*, unsigned long, float*, float*
   4.31%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::TransformToPixels(jxl::AcStrategy::Type,
float*, float*, unsigned long, float*)
   4.00%  cjxl libjxl.so.0.10.1   [.]
jxl::ThreadPool::RunCallState const&, int const* restrict*, jxl::AcStra
   3.56%  cjxl libm.so.6  [.] __ieee754_pow_fma
   3.49%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::IDCT1DImpl<8ul, 8ul>::operator()(float
const*, unsigned long, float*, unsigned long, f
   3.43%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous
namespace)::AdaptiveQuantizationImpl::ComputeTile(float, float,
jxl::Image3 const&, jxl::Re
   3.27%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::(anonymous namespace)::DCT1DWrapper<32ul, 0ul,
jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2:
   3.16%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<8ul, 8ul>::operator()(float*,
float*) [clone .isra.0]
   2.87%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::(anonymous namespace)::ComputeScaledIDCT<4ul,
8ul>::operator()::operator()::operator() const&, jxl::RectT
const&, jxl::DequantMatrices const&, jxl::AcStrategyImage const*,
jxl::Plane const*, jxl::Quantizer const*, jxl::Rect▒
   5.03%  cjxl libjxl.so.0.10.1   [.]
jxl::ThreadPool::RunCallState const&, jxl::RectT
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*, jxl::Pla▒
   4.66%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<16ul, 8ul>::operator()(float*,
float*)
   ▒
   4.56%  cjxl libjxl.so.0.10.1   [.]
jxl::Th

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #9 from Jan Hubicka  ---
Phoronix still claims the difference
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2

[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236

--- Comment #3 from Jan Hubicka  ---
Seems this perofmance difference is still there on zen4
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3

[Bug tree-optimization/114787] [13 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787

--- Comment #18 from Jan Hubicka  ---
predict.cc queries number of iterations using number_of_iterations_exit and
loop_niter_by_eval and finally using estimated_stmt_executions.

The first two queries are not updating the upper bounds datastructure so that
is why we get around without computing them in some cases.

I guess we can just drop dumping here. We now dump the recorded estimates
elsehwere, so this is somewhat redundant.

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #13 from Jan Hubicka  ---
Thanks a lot, looks great!
Do we still auto-detect memmove when the copy constructor turns out to be
memcpy equivalent after optimization?

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #9 from Jan Hubicka  ---
Your patch gives me error compiling testcase

jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C 
In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65,
 from /home/jh/t.C:1:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In
instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator,
_InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const
pair*; _ForwardIterator = pair*; _Allocator = allocator >;
_Traits = allocator_traits > >]’:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31:  
required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with
_Tp = pair; _Up = pair]’
 1127 |   return std::__relocate_a(__cfirst, __clast, __result, __alloc);
  |  ~^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26:   required
from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp,
_Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; pointer =
std::pair*; _Tp_alloc_type =
std::vector >::_Tp_alloc_type]’
  509 | return std::__relocate_a(__first, __last, __result, __alloc);
  |~^~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32:   required
from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args
= {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator
>]’
  647 | __new_finish = _S_relocate(__old_start, __old_finish,
  |~~~^~~
  648 |__new_start,
_M_get_Tp_allocator());
  |   
~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21:   required
from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; value_type =
std::pair]’
 1294 |   _M_realloc_append(__x);
  |   ~^
/home/jh/t.C:8:25:   required from here
8 | stack.push_back (pair);
  | ^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with
_Tp = pair]’
 1084 | 
std::addressof(std::move(*__first
  | 
~~^
In file included from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61,
 from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64,
 from /home/jh/trunk-install/include/c++/14.0.1/vector:62:
/home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared
here
  168 | const _Tp* addressof(const _Tp&&) = delete;
  |^
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
note: use ‘-fdiagnostics-all-candidates’ to display considered candidates
 1084 | 
std::addressof(std::move(*__first
  | 
~~^


It is easy to check if conversion happens - just compile it and see if there is
memcpy or memmove in the optimized dump file (or final assembly)

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #8 from Jan Hubicka  ---
I had wrong noexcept specifier.  This version works, but I still need to inline
relocate_object_a into the loop

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..f02d4fb878f 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  "relocation is only possible for values of the same type");
   _ForwardIterator __cur = __result;
   for (; __first != __last; ++__first, (void)++__cur)
-   std::__relocate_object_a(std::__addressof(*__cur),
-std::__addressof(*__first), __alloc);
+   {
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, std::__addressof(*__cur),
std::move(*std::__addressof(*__first)));
+ __traits::destroy(__alloc,
std::__addressof(*std::__addressof(*__first)));
+   }
   return __cur;
 }

@@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc)))
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #6 from Jan Hubicka  ---
Thanks. I though the relocate_a only cares about the fact if the pointed-to
type can be bitwise copied.  It would be nice to early produce memcpy from
libstdc++ for std::pair, so the second patch makes sense to me (I did not test
if it works)

I think it would be still nice to tell GCC that the copy loop never gets
overlapping memory locations so the cases which are not early optimized to
memcpy can still be optimized later (or vectorized if it does really something
non-trivial).

So i tried your second patch fixed so it compiles:
diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..0d2e588ae5e 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(std::__is_bitwise_relocatable<_Tp>::value)
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

it does not make ldist to hit, so the restrict info is still lost.  I think the
problem is that if you call relocate_object the restrict reduces scope, so we
only know that the elements are pairwise disjoint, not that the vectors are.
This is because restrict is interpreted early pre-inlining, but it is really
Richard's area.

It seems that the patch makes us to go through __uninitialized_copy_a instead
of __uninit_copy. I am not even sure how these are different, so I need to
stare at the code bit more to make sense of it :)

[Bug middle-end/114822] New: ldist should produce memcpy/memset/memmove histograms based on loop information converted

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114822

Bug ID: 114822
   Summary: ldist should produce memcpy/memset/memmove histograms
based on loop information converted
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

When loop is converted to string builtin we lose information about its size.
This means that we won't expand it inline when the block size is expected to be
small.  This causes performance problem i.e. on std::vector and testcase from
PR114821  which at least with profile feedback runs significantly slower than
variant where memcpy is produced early


#include 
typedef unsigned int uint32_t;
int pair;
void
test()
{
std::vector stack;
stack.push_back (pair);
while (!stack.empty()) {
int cur = stack.back();
stack.pop_back();
if (true)
{
cur++;
stack.push_back (cur);
stack.push_back (cur);
}
if (cur > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #2 from Jan Hubicka  ---
What I am shooting for is to optimize it later in loop distribution. We can
recognize memcpy loop if we can figure out that source and destination memory
are different.

We can help here with restrict, but I was bit lost in how to get them done.

This seems to do the trick, but for some reason I get memmove

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..1a6223ea892 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
   return __result + __count;
 }
+
+  template 
+_GLIBCXX20_CONSTEXPR
+inline __enable_if_t::value, _Tp*>
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc) noexcept
+{
+  ptrdiff_t __count = __last - __first;
+  if (__count > 0)
+   {
+#ifdef __cpp_lib_is_constant_evaluated
+ if (std::is_constant_evaluated())
+   {
+ for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict
qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+ return __result;
+   }
 #endif
+ __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
+   }
+  return __result + __count;
+}
+#endif
+
+  template 
+_GLIBCXX20_CONSTEXPR
+#if _GLIBCXX_HOSTED
+inline __enable_if_t::value, _Tp*>
+#else
+inline _Tp *
+#endif
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc)
+noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc,
+__result, std::move(*__first)))
+&& noexcept(std::allocator_traits<_Allocator>::destroy(
+   __alloc, std::__addressof(*__first
+{
+  for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+  return __result;
+}

   template 

[Bug libstdc++/114821] New: _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

Bug ID: 114821
   Summary: _M_realloc_append should use memcpy instead of loop to
copy data when possible
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

In thestcase

#include 
typedef unsigned int uint32_t;
std::pair pair;
void
test()
{
std::vector> stack;
stack.push_back (pair);
while (!stack.empty()) {
std::pair cur = stack.back();
stack.pop_back();
if (!cur.first)
{
cur.second++;
stack.push_back (cur);
stack.push_back (cur);
}
if (cur.second > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

We produce _M_reallloc_append which uses loop to copy data instead of memcpy.
This is bigger and slower.  The reason why __relocate_a does not use memcpy
seems to be fact that pair has copy constructor. It still can be pattern
matched by ldist but it fails with:

(compute_affine_dependence
  ref_a: *__first_1, stmt_a: *__cur_37 = *__first_1;
  ref_b: *__cur_37, stmt_b: *__cur_37 = *__first_1;
) -> dependence analysis failed

So we can not disambiguate old and new vector memory and prove that loop is
indeed memcpy loop. I think this is valid since operator new is not required to
return new memory, but I think adding __restrict should solve this.

Problem is that I got lost on where to add them, since relocate_a uses
iterators instead of pointers

[Bug tree-optimization/114787] [13/14 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)

2024-04-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787

--- Comment #13 from Jan Hubicka  ---
-fdump-tree-all-all  changing generated code is also bad.  We probably should
avoid dumping loop bounds then they are not recorded. I added dumping of loop
bounds and this may be unexpected side effect. WIll take a look.

[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline

2024-04-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008

--- Comment #8 from Jan Hubicka  ---
Note that cold attribute is also quite strong since it turns optimize_size
codegen that is often a lot slower.

Reading the discussion again, I don't think we have a way to make inline
keyword ignored by inliner.  We can make not_really_inline attribute (better
name would be welcome).

[Bug tree-optimization/114779] __builtin_constant_p does not work in inline functions

2024-04-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114779

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #7 from Jan Hubicka  ---
Note that the test about side-effects also makes it impossible to test for
constantness of values passed to function by reference which could be also
useful. Workaround is to load it into temporary so the side-effect is not seen.
So that early folding to 0 never made too much of sense to me.

I agree that it is a can of worms and it is not clear if changing behaviour
would break things...

[Bug middle-end/114774] Missed DSE in simple code due to interleaving sotres

2024-04-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

Jan Hubicka  changed:

   What|Removed |Added

Summary|Missed DSE in simple code   |Missed DSE in simple code
   |due to other stores being   |due to interleaving sotres
   |conditional |

--- Comment #1 from Jan Hubicka  ---
the other store being conditional is not the core issue. Here we miss DSE too:

#include 
int a;
short p,q;
void
test (int b)
{
a=1;
if (b)
  p++;
else
  q++;
a=2;
}

The problem in DSE seems to be that instead of recursively walking the
memory-SSA graph it insist the graph to form a chain. Now SRA leaves stores to
scalarized variables and even removes the corresponding clobbers, so this is
relatively common scenario in non-trivial C++ code.

[Bug middle-end/114774] New: Missed DSE in simple code

2024-04-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

Bug ID: 114774
   Summary: Missed DSE in simple code
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

In the following

#include 
int a;
short *p;
void
test (int b)
{
a=1;
if (b)
{
(*p)++;
a=2;
printf ("1\n");
}
else 
{
(*p)++;
a=3;
printf ("2\n");
}
}

We are not able to optimize out "a=1". This is simplified real-world scenario
where SRA does not remove definition of SRAed variables.

Note that clang does conditional move here
test:   # @test
.cfi_startproc
# %bb.0:
movqp(%rip), %rax
incw(%rax)
xorl%eax, %eax
testl   %edi, %edi
leaq.Lstr(%rip), %rcx
leaq.Lstr.2(%rip), %rdi
cmoveq  %rcx, %rdi
sete%al
orl $2, %eax
movl%eax, a(%rip)
jmp puts@PLT# TAILCALL

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #19 from Jan Hubicka  ---
I looked into the remaining exit/nonexit rename discussed here earlier before
the PR was closed. The following patch would restore the code to do the same
calls as before my patch
PR tree-optimization/109596
* tree-ssa-loop-ch.c (ch_base::copy_headers): Fix use of exit/nonexit
edges.
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index b7ef485c4cc..cd5f6bc3c2a 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -952,13 +952,13 @@ ch_base::copy_headers (function *fun)
   if (!single_pred_p (nonexit->dest))
{
  header = split_edge (nonexit);
- exit = single_pred_edge (header);
+ nonexit = single_pred_edge (header);
}

   edge entry = loop_preheader_edge (loop);

   propagate_threaded_block_debug_into (nonexit->dest, entry->dest);
-  if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
+  if (!gimple_duplicate_seme_region (entry, nonexit, bbs, n_bbs,
copied_bbs,
 true))
{
  delete candidate.static_exits;

I however convinced myself this is an noop. both exit and nonexit sources have
same basic blocks.  

propagate_threaded_block_debug_into walks predecessors of its first parameter
and moves debug statements to the second parameter, so it does the same job,
since the split BB is empty.

gimple_duplicate_seme_region uses the parametr to update loop header but it
does not do that correctly for loop header copying and we re-do it in
tree-ssa-loop-ch.

Still the code as it is now in trunk is very confusing, so perhaps we should
update it?

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #28 from Jan Hubicka  ---
So the main problem is that in t2 we have

_ZN6vectorI12QualityValueEC1ERKS1_/7 (vector<_Tp>::vector(const vector<_Tp>&)
[with _Tp = QualityValue])
  Type: function definition analyzed alias cpp_implicit_alias
  Visibility: semantic_interposition public weak comdat
comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only
  Same comdat group as: _ZN6vectorI12QualityValueEC2ERKS1_/6
  References: _ZN6vectorI12QualityValueEC2ERKS1_/6 (alias) 
  Referring: 
  Function flags:
  Called by: _Z41__static_initialization_and_destruction_0v/8 (can throw
external)
  Calls: 

and in t1 we have

_ZN6vectorI12QualityValueEC1ERKS1_/2 (constexpr vector<_Tp>::vector(const
vector<_Tp>&) [with _Tp = QualityValue])
  Type: function definition
  Visibility: semantic_interposition external public weak comdat
comdat_group:_ZN6vectorI12QualityValueEC1ERKS1_ one_only
  References: 
  Referring:
  Function flags:
  Called by: 
  Calls: 

This is the same symbol name but in two different comdat groups (C1 compared to
C5).  With -O0 both seems to get the C5 group

I can silence the ICE by making aliases undefined during symbol merging (which
is kind of hack but should make sanity checks happy), but I am still lost how
this is supposed to work in valid code.

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #27 from Jan Hubicka  ---
OK, but the problem is same. Having comdats with same key defining different
set of public symbols is IMO not a good situation for both non-LTO and LTO
builds.
Unless the additional alias is never used by valid code (which would make it
useless and probably we should not generate it) it should be possible to
produce a scenario where linker will pick wrong version of comdat and we get
undefined symbol in non-LTO builds...

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #25 from Jan Hubicka  ---
So we have comdat groups that diverges in t1.o and t2.o.  In one object it has
alias in it while in other object it does not

Merging nodes for _ZN6vectorI12QualityValueEC2ERKS1_. Candidates:
_ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition prevailing_def_ironly
public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only
  next sharing asm name: 19
  References: 
  Referring:  
  Read from file: t1.o
  Unit id: 1
  Function flags: count:1073741824 (estimated locally)
  Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw
external)
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00
per call) (can throw external)
_ZN6vectorI12QualityValueEC2ERKS1_/19 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition preempted_ir public
weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only
  Same comdat group as: _ZN6vectorI12QualityValueEC1ERKS1_/20
  previous sharing asm name: 1
  References: 
  Referring: _ZN6vectorI12QualityValueEC1ERKS1_/20 (alias)
  Read from file: t2.o
  Unit id: 2
  Function flags: count:1073741824 (estimated locally)
  Called by:
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/23 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/24 (1073741824 (estimated locally),1.00
per call) (can throw external)
After resolution:
_ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition prevailing_def_ironly
public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only
  next sharing asm name: 19
  References: 
  Referring: 
  Read from file: t1.o
  Unit id: 1
  Function flags: count:1073741824 (estimated locally)
  Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw
external)
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00
per call) (can throw external)

We opt for version without alias and later ICE in sanity check verifying that
aliases have same comdat group as their targets.

I wonder how this is ice-on-valid code, since with normal linking the aliased
symbol may or may not appear in the winning comdat group, so using he alias has
to break.

If constexpr changes how the constructor is generated, isn't this violation of
ODR?

We probably can go and reset every node in losing comdat group to silence the
ICE and getting undefined symbol instead

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-04-09 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #8 from Jan Hubicka  ---
I am not sure this ought to be P1:
 - the compilation technically is finite, but not in reasonable time
 - it is possible to adjust the testcas (do early inlining manually) and get
same infinite build on release branches
 - if you ask for inline bomb, you get it.

But after some more testing, I do not see reasonably easy way to get better
diagnostics. So I will retest the patch fro #6 and go ahead with it.

[Bug ipa/113359] [13/14 Regression] LTO miscompilation of ceph on aarch64 and x86_64

2024-04-04 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #23 from Jan Hubicka  ---
The patch looks reasonable.  We probably could hash the padding vectors at
summary generation time to reduce WPA overhead, but that can be done
incrementally next stage1.
I however wonder if we really guarantee to copy the paddings everywhere else
then the total scalarization part?
(i.e. in all paths through the RTL expansion)

[Bug ipa/109817] internal error in ICF pass on Ada interfaces

2024-04-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109817

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #5 from Jan Hubicka  ---
That check was added to verify that we do not lose the thunk annotations.  Now
when datastructure is stable, i think we can simply drop it, if that makes Ada
to work.

[Bug gcov-profile/113765] [14 Regression] ICE: autofdo: val-profiler-threads-1.c compilation, error: probability of edge from entry block not initialized

2024-03-26 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113765

--- Comment #6 from Jan Hubicka  ---
Running auto-fdo without guessing branch probabilities is somewhat odd idea in
general.  I suppose we can indeed just avoid setting full_profile flag. Though
the optimization passes are not that much tested to work with non-full profiles
so there is some risk that resulting code will be worse than without auto-FDO.

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #7 from Jan Hubicka  ---
Found it, probably. I renamed exit to nonexit (since name was misleading) and
then forgot to update
 propagate_threaded_block_debug_into (exit->dest, entry->dest);

I will check this after teaching (which I have in 10 mins)

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #6 from Jan Hubicka  ---
On this testcase trunk does get same dump as gcc13 for pass just before ch2

with ch2 we get:
@@ -192,9 +236,8 @@
   # DEBUG BEGIN_STMT
   goto ; [100.00%]

-   [local count: 954449105]:
+   [local count: 954449104]:
   # j_15 = PHI 
-  # DEBUG j => j_15
   # DEBUG BEGIN_STMT
   a[b_14][j_15] = 0;
   # DEBUG BEGIN_STMT
@@ -203,29 +246,30 @@
   # DEBUG j => j_9
   # DEBUG BEGIN_STMT
   if (j_9 <= 7)
-goto ; [88.89%]
+goto ; [87.50%]
   else
-goto ; [11.11%]
+goto ; [12.50%]

[local count: 119292720]:
+  # DEBUG j => 0
   # DEBUG BEGIN_STMT
   b_7 = b_14 + 1;
   # DEBUG b => b_7
   # DEBUG b => b_7
   # DEBUG BEGIN_STMT
   if (b_7 <= 6)
-goto ; [87.50%]
+goto ; [85.71%]
   else
-goto ; [12.50%]
+goto ; [14.29%]

[local count: 119292720]:
   # b_14 = PHI 
-  # DEBUG b => b_14
   # DEBUG j => 0
   # DEBUG BEGIN_STMT
   goto ; [100.00%]

[local count: 17041817]:
+  # DEBUG b => 0
   # DEBUG BEGIN_STMT
   optimize_me_not ();
   # DEBUG BEGIN_STMT


So in addition to updating BB profile, we indeed end up moving debug statements
around.

The change of dump is:
+  Analyzing: if (b_1 <= 6)
+Will eliminate peeled conditional in bb 6.
+May duplicate bb 6
+  Not duplicating bb 8: it is single succ.
+  Analyzing: if (j_2 <= 7)
+Will eliminate peeled conditional in bb 4.
+May duplicate bb 4
+  Not duplicating bb 3: it is single succ.
 Loop 2 is not do-while loop: latch is not empty.
+Duplicating header BB to obtain do-while loop
 Copying headers of loop 1
 Will duplicate bb 6
-  Not duplicating bb 8: it is single succ.
-Duplicating header of the loop 1 up to edge 6->8, 2 insns.
+Duplicating header of the loop 1 up to edge 6->7
 Loop 1 is do-while loop
 Loop 1 is now do-while loop.
+Exit count: 17041817 (estimated locally)
+Entry count: 17041817 (estimated locally)
+Peeled all exits: decreased number of iterations of loop 1 by 1.
 Copying headers of loop 2
 Will duplicate bb 4
-  Not duplicating bb 3: it is single succ.
-Duplicating header of the loop 2 up to edge 4->3, 2 insns.
+Duplicating header of the loop 2 up to edge 4->5
 Loop 2 is do-while loop
 Loop 2 is now do-while loop.
+Exit count: 119292720 (estimated locally)
+Entry count: 119292720 (estimated locally)
+Peeled all exits: decreased number of iterations of loop 2 by 1.

Dumps moved around, but we do same duplicaitons as before (BB6 and BB4 to
eliminate the conditionals).

   [local count: 1073741824]:
  # j_2 = PHI <0(8), j_9(3)>
  # DEBUG j => j_2
  # DEBUG BEGIN_STMT
  if (j_2 <= 7)
goto ; [88.89%]
  else
goto ; [11.11%]

   [local count: 136334537]:
  # b_1 = PHI <0(2), b_7(5)>
  # DEBUG b => b_1
  # DEBUG BEGIN_STMT
  if (b_1 <= 6)
goto ; [87.50%]
  else
goto ; [12.50%]

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #4 from Jan Hubicka  ---
The change makes loop iteration estimates more realistics, but does not
introduce any new code that actually changes the IL, so it seems this makes
existing problem more visible.  I will try to debug what happens.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #59 from Jan Hubicka  ---
just to explain what happens in the testcase.  There is test and testb. They
are almost same:

int
testb(void)
{
  struct bar *fp;
  test2 ((void *)&fp);
  fp = NULL;
  (*ptr)++;
  test3 ((void *)&fp);
}
the difference is in the alias set of FP. In one case it aliases with the
(*ptr)++ while in other it does not.  This makes one function to have jump
function specifying aggregate value of 0 for *fp, while other does not.

Now with LTO both struct bar and foo becomes compatible for TBAA, so the
functions gets merged and the winning variant has the jump function specifying
aggregate 0, which is wrong in the context code is invoked.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #58 from Jan Hubicka  ---
Created attachment 57702
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57702&action=edit
Compare value ranges in jump functions

This patch implements the jump function compare, however it is not good enough.
 Here is another wrong code:

jh@ryzen3:~/gcc/build/stage1-gcc> cat a.c
#include 
#include 

__attribute__((used)) int val,val2 = 1;

struct foo {int a;};

struct foo **ptr;

__attribute__ ((noipa))
int
test2 (void *a)
{ 
  ptr = (struct foo **)a;
}
int test3 (void *a);

int
test(void)
{ 
  struct foo *fp;
  test2 ((void *)&fp);
  fp = NULL;
  (*ptr)++;
  test3 ((void *)&fp);
}

int testb (void);

int
main()
{ 
  for (int i = 0; i < val2; i++)
  if (val)
testb ();
  else
test();
}
jh@ryzen3:~/gcc/build/stage1-gcc> cat b.c
#include 
struct bar {int a;};
struct foo {int a;};
struct barp {struct bar *f; struct bar *g;};
extern struct foo **ptr;
int test2 (void *);
int test3 (void *);
int
testb(void)
{
  struct bar *fp;
  test2 ((void *)&fp);
  fp = NULL;
  (*ptr)++;
  test3 ((void *)&fp);
}
jh@ryzen3:~/gcc/build/stage1-gcc> cat c.c
#include 
__attribute__ ((noinline))
int
test3 (void *a)
{
  if (!*(void **)a)
  abort ();
  return 0;
}
jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B
./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc  -B ./ b.o a.o c.o ; ./a.out
Aborted (core dumped)
jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B
./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc  -B ./ b.o a.o c.o
--disable-ipa-icf ; ./a.out 
lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]
lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-13 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #55 from Jan Hubicka  ---
> Anyway, can we in the spot my patch changed just walk all 
> source->node->callees > cgraph_edges, for each of them find the corresponding 
> cgraph_edge in the alias > and for each walk all the jump_functions recorded 
> and union their m_vr?
> Or is that something that can't be done in LTO for some reason?

That was my fist idea too, but the problem is that icf has (very limited)
support for matching function which differ by order of the basic blocks: it
computes hash of every basic block and orders them by their hash prior
comparing. This seems half-finished since i.e. order of edges in PHIs has to
match exactly.

Callee lists are officially randomly ordered, but practically they follows the
order of basic blocks (as they are built this way).  However since BB orders
can differ, just walking both callee sequences and comparing pairwise does not
work. This also makes merging the information harder, since we no longer have
the BB map at the time decide to merge.

It is however not hard to match the jump function while walking gimple bodies
and comparing statements, which is backportable and localized. I am still
waiting for my statistics to converge and will send it soon.

[Bug ipa/106716] Identical Code Folding (-fipa-icf) confuses between functions with different [[likely]] attributes

2024-03-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106716

--- Comment #6 from Jan Hubicka  ---
The reason why GIMPLE_PREDICT is ignored is that it is never used after ipa-icf
and gets removed at the very beggining of late optimizations.  

GIMPLE_PREDICT is consumed by profile_generate pass which is run before
ipa-icf.  The reason why GIMPLE_PREDICT statements are not stripped during ICF
is early inlining.  If we early inline, we throw away its profile and estimate
it again (in the context of function it was inlined to) and for that it is a
good idea to keep predicts.

There is no convenient place to remove them after early inlining was done and
before IPA passes and that is the only reason why they are around.  We may
revisit that since streaming them to LTO bytecode is probably more harmful then
adding extra pass after early opts to strip them.

ICF doesn't code to compare edge profiles and stmt histograms.  It knows how to
merge them (so resulting BB profile is consistent with merging) but I suppose
we may want to have some threshold on when we do not want to marge functions
with very different branch probabilities in the hot part of their bodies...

[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition

2024-03-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241

Jan Hubicka  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Jan Hubicka  ---
mine. Will debug why the tables diverges.

[Bug debug/92387] [11/12/13 Regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a

2024-03-04 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387

--- Comment #5 from Jan Hubicka  ---
The revision is changing inlining decisions, so it would be probably possible
to reproduce the problem without that change with right alaways_inline and
noinline attributes.

[Bug tree-optimization/114207] [12/13/14 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439

2024-03-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #3 from Jan Hubicka  ---
mine.

The summary is:
  loads:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:4 offset:0 size:64 max_size:64
  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64

while with fwprop we get:
  loads:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64
  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64

So it seems that offset is misaccounted.

[Bug lto/85432] Wodr can be more verbose for C code

2024-03-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85432

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #1 from Jan Hubicka  ---
This should be solved for a long time.  We recognize ODR types by mangled names
produced only by C++ frontend.  I checked that GCC 12, 13 and trunk does not
produce the warning.

[Bug tree-optimization/114052] [11/12/13/14 Regression] Wrong code at -O2 for well-defined infinite loop

2024-02-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114052

--- Comment #5 from Jan Hubicka  ---
So if I understand it right, you want to determine the property that if the
loop header is executed then BB containing undefined behavior at that iteration
will be executed, too.

modref tracks if function will always return and if it can not determine it, it
will set the side_effect flag. So you can check for that in modref summary.
It uses finite_function_p which was originally done for pure/const detection
and  is implemented by looking at loop nest if all loops are known to be finite
and also by checking for irreducible loops.

In your setup you probably also want to check for volatile asms that are also
possibly infinite. In mod-ref we get around by considering them to be
side-effects anyway.


There is also determine_unlikely_bbs which is trying to set profile_count to
zero to as many basic blocks as possible by propagating from basic blocks
containing undefined behaviour or cold noreturn call backward & forward.

The backward walk can be used to determine the property hat executing header
implies UB.  It stops on all loops though. In this case it would be nice to
walk through loops known to be finite...

[Bug ipa/108802] [11/12/13/14 Regression] missed inlining of call via pointer to member function

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108802

--- Comment #5 from Jan Hubicka  ---
I don't think we can reasonably expect every caller of lambda function to be
early inlined, so we need to extend ipa-prop to understand the obfuscated code.
 I disucussed that with Martin some time ago - I think this is quite common
problem with modern C++, so we will need to pattern match this, which is quite
unfortunate.

[Bug ipa/111960] [14 Regression] ICE: during GIMPLE pass: rebuild_frequencies: SIGSEGV (Invalid read of size 4) with -fdump-tree-rebuild_frequencies-all

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111960

--- Comment #5 from Jan Hubicka  ---
hmm. cfg.cc:815 for me is:
fputs (", maybe hot", outf);
which seems quite safe.

The problem does not seem to reproduce for me:
jh@ryzen3:~/gcc/build/gcc> ./xgcc -B ./  tt.c -O
--param=max-inline-recursive-depth=100 -fdump-tree-rebuild_frequencies-all
-wrapper valgrind
==25618== Memcheck, a memory error detector
==25618== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25618== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25618== Command: ./cc1 -quiet -iprefix
/home/jh/gcc/build/gcc/../lib64/gcc/x86_64-pc-linux-gnu/14.0.1/ -isystem
./include -isystem ./include-fixed tt.c -quiet -dumpdir a- -dumpbase tt.c
-dumpbase-ext .c -mtune=generic -march=x86-64 -O
-fdump-tree-rebuild_frequencies-all --param=max-inline-recursive-depth=100
-o /tmp/ccpkfjdK.s
==25618== 
==25618== 
==25618== HEAP SUMMARY:
==25618== in use at exit: 1,818,714 bytes in 1,175 blocks
==25618==   total heap usage: 39,645 allocs, 38,470 frees, 12,699,874 bytes
allocated
==25618== 
==25618== LEAK SUMMARY:
==25618==definitely lost: 0 bytes in 0 blocks
==25618==indirectly lost: 0 bytes in 0 blocks
==25618==  possibly lost: 8,032 bytes in 1 blocks
==25618==still reachable: 1,810,682 bytes in 1,174 blocks
==25618== suppressed: 0 bytes in 0 blocks
==25618== Rerun with --leak-check=full to see details of leaked memory
==25618== 
==25618== For lists of detected and suppressed errors, rerun with: -s
==25618== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==25627== Memcheck, a memory error detector
==25627== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25627== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25627== Command: ./as --64 -o /tmp/ccp5TNme.o /tmp/ccpkfjdK.s
==25627== 
==25637== Memcheck, a memory error detector
==25637== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25637== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25637== Command: ./collect2 -plugin ./liblto_plugin.so
-plugin-opt=./lto-wrapper -plugin-opt=-fresolution=/tmp/cclWZD7F.res
-plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s
-plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s --eh-frame-hdr -m elf_x86_64 -dynamic-linker
/lib64/ld-linux-x86-64.so.2 /lib/../lib64/crt1.o /lib/../lib64/crti.o
./crtbegin.o -L. -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccp5TNme.o -lgcc
--push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed
-lgcc_s --pop-state ./crtend.o /lib/../lib64/crtn.o
==25637== 
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:
/lib/../lib64/crt1.o: in function `_start':
/home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:103:(.text+0x2b):
undefined reference to `main'
collect2: error: ld returned 1 exit status
==25637== 
==25637== HEAP SUMMARY:
==25637== in use at exit: 89,760 bytes in 39 blocks
==25637==   total heap usage: 175 allocs, 136 frees, 106,565 bytes allocated
==25637== 
==25637== LEAK SUMMARY:
==25637==definitely lost: 0 bytes in 0 blocks
==25637==indirectly lost: 0 bytes in 0 blocks
==25637==  possibly lost: 0 bytes in 0 blocks
==25637==still reachable: 89,760 bytes in 39 blocks
==25637==   of which reachable via heuristic:
==25637== newarray   : 1,544 bytes in 1 blocks
==25637== suppressed: 0 bytes in 0 blocks
==25637== Rerun with --leak-check=full to see details of leaked memory
==25637== 
==25637== For lists of detected and suppressed errors, rerun with: -s
==25637== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

Jan Hubicka  changed:

   What|Removed |Added

Summary|[14 regression] ICU |[12/13/14 regression] ICU
   |miscompiled since on x86|miscompiled since on x86
   |since   |since
   |r14-5109-ga291237b628f41|r14-5109-ga291237b628f41

--- Comment #41 from Jan Hubicka  ---
OK, the reason why this does not work is that ranger ignores earlier value
ranges on everything but default defs and phis.

// This is where the ranger picks up global info to seed initial
// requests.  It is a slightly restricted version of
// get_range_global() above.
//
// The reason for the difference is that we can always pick the
// default definition of an SSA with no adverse effects, but for other
// SSAs, if we pick things up to early, we may prematurely eliminate
// builtin_unreachables.
//
// Without this restriction, the test in g++.dg/tree-ssa/pr61034.C has
// all of its unreachable calls removed too early.
//
// See discussion here:
// https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571709.html

void
gimple_range_global (vrange &r, tree name, struct function *fun)
{
  tree type = TREE_TYPE (name);
  gcc_checking_assert (TREE_CODE (name) == SSA_NAME);

  if (SSA_NAME_IS_DEFAULT_DEF (name) || (fun && fun->after_inlining)
  || is_a (SSA_NAME_DEF_STMT (name)))
{ 
  get_range_global (r, name, fun);
  return;
}
  r.set_varying (type);
}


This makes ipa-prop to ignore earlier known value range and mask the bug. 
However adding PHI makes the problem to reproduce:
#include 
#include 
int data[100];
int c;

static __attribute__((noinline))
int bar (int d, unsigned int d2)
{
  if (d2 > 30)
  c++;
  return d + d2;
}
static int
test2 (unsigned int i)
{
  if (i > 100)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i&1 ? i+17 : i + 16);
  return data[i];
}

static int
test (unsigned int i)
{
  if (i > 10)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i&1 ? i+17 : i + 16);
  return data[i];
}
int
main ()
{
  int ret = test (1) + test (2) + test (3) + test2 (4) + test2 (30);
  if (!c)
  abort ();
  return ret;
}

This fails with trunk, gcc12 and gcc13 and also with Jakub's patch.

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #39 from Jan Hubicka  ---
This testcase
#include 
int data[100];

__attribute__((noinline))
int bar (int d, unsigned int d2)
{
  if (d2 > 10)
printf ("Bingo\n");
  return d + d2;
}

int
test2 (unsigned int i)
{
  if (i > 10)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  printf ("%i\n",i);
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i+17);
  return data[i];
}
int
test (unsigned int i)
{
  if (i > 100)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  printf ("%i\n",i);
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i+17);
  return data[i];
}
int
main ()
{
  test (1);
  test (2);
  test (3);
  test2 (4);
  test2 (100);
  return 0;
}

gets me most of what I want to reproduce ipa-prop problem. Functions test and
test2 are split with different value ranges visible in the fnsplit dump. 
However curiously enough ipa-prop analysis seems to ignore the value ranges and
does not attach them to the jump function, which is odd...

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #31 from Jan Hubicka  ---
Having a testcase is great. I was just playing with crafting one.
I am still concerned about value ranges in ipa-prop's jump functions.
Let me see if I can modify the testcase to also trigger problem with value
ranges in ipa-prop jump functions.

Not streaming value ranges is an omission on my side (I mistakely assumed we do
stream them).  We ought to stream them, since otherwise we will lose propagated
return value ranges in partitioned programs, which is pity.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #6 from Jan Hubicka  ---
Created attachment 57427
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57427&action=edit
patch

The patch makes compilation to finish in reasonable time.
I ended up in need to dropping DISREGARD_INLINE_LIMITS in late inlining for
functions with self recursive always inlines, since these grow large quick and
even non-recursive inlining is too slow.  We also end up with quite ugly
diagnostics of form:

tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param
max-inline-insns-auto limit reached
   13 | f1 (void)
  | ^~
tt.c:17:3: note: called from here
   17 |   f1 ();
  |   ^
tt.c:6:1: error: inlining failed in call to ‘always_inline’ ‘f0’: --param
max-inline-insns-auto limit reached
6 | f0 (void)
  | ^~
tt.c:16:3: note: called from here
   16 |   f0 ();
  |   ^
tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param
max-inline-insns-auto limit reached
   13 | f1 (void)
  | ^~
tt.c:15:3: note: called from here
   15 |   f1 ();
  |   ^
In function ‘f1’,
inlined from ‘f0’ at tt.c:8:3,


which is quite large so I can not add it to a testuiste.  I will see if I can
reduce this even more.

[Bug middle-end/111054] [14 Regression] ICE: in to_sreal, at profile-count.cc:472 with -O3 -fno-guess-branch-probability since r14-2967

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111054

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Jan Hubicka  ---
Fixed.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #5 from Jan Hubicka  ---
There is a cap in want_inline_self_recursive_call_p which gives up on inlining
after reaching max recursive inlining depth of 8. Problem is that the tree here
is too wide. After early inlining f0 contains 4 calls to f1 and 3 calls to f0.
Similarly for f0, so we have something like (9+3*9)^8 as a cap on number of
inlines that takes a while to converge.

One may want to limit number of copies of function A within function B rather
than depth, but that number can be large even for sane code.

I am making patch to make inliner to ignore always_inline on all self-recrusive
inline decisions.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #4 from Jan Hubicka  ---
There is a cap in want_inline_self_recursive_call_p which gives up on inlining
after reaching max recursive inlining depth of 8. Problem is that the tree here
is too wide. After early inlining f0 contains 4 calls to f1 and

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #29 from Jan Hubicka  ---
Safest fix is to make equals_p to reject merging functions with different value
ranges assigned to corresponding SSA names.  I would hope that, since early
opts are still mostly local, that does not lead to very large degradation. This
is lame of course.

If we go for smarter merging, we need to also handle ipa-prop jump functions. 
In that case I think equals_p needs to check if value range sin SSA_NAMES and
jump functions differs and if so, keep that noted so the merging code can do
corresponding update.  I will check how hard it is to implement this. 
(Equality handling is Martin Liska's code, but if I recall right, each
equivalence class has a leader, and we can keep track if there are some
differences WRT that leader, but I do not recall how subdivision of equivalence
classes is handled).

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-13 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #13 from Jan Hubicka  ---
So my understanding is that ivopts does something like

 offset = &base2 - &base1

and then translate
 val = base2[i]
to
 val = *((base1+i)+offset)

Where (base1+i) is then an iv variable.

I wonder if we consider doing memory reference with base changed via offset a
valid transformation. Is there way to tell when this happens?
A quick fix would be to run IPA modref before ivopts, but I do not see how such
transformation can work with rest of alias analysis (PTA etc)

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #8 from Jan Hubicka  ---
I will take a look.  Mod-ref only reuses the code detecting errneous paths in
ssa-split-paths, so that code will get confused, too. It makes sense for ivopts
to compute difference of two memory allocations, but I wonder if that won't
also confuse PTA and other stuff, so perhaps we need way to exlicitely tag
memory location where such optimization happen? (to make it clear that original
base is lost, or keep track of it)

[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64

2024-02-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #11 from Jan Hubicka  ---
If there are two ODR types with same ODR name one with integer and other with
pointer types third field, then indeed we should get ODR warning and give up on
handling them as ODR types for type merging.

So dumping their assembler names would be useful starting point.

Of course if you have two ODR types of different names but you mix them up in
COMDAT function of same name, then the warning will not trigger, so this might
be some missing type compatibility check in ipa-sra or ipa-prop summary, too.

[Bug ipa/97119] Top level option to disable creation of IPA symbols such as .localalias is desired

2024-02-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97119

--- Comment #7 from Jan Hubicka  ---
Local aliases are created by ipa-visibility pass.  Most common case is that
function is declared inline but ELF superposition rules say that the symbol can
be overwritten by a different library.  Since GCC knows that all
implementaitons must be equivalent, it can force calls within DSO to be direct.

I am not quite sure how this confuses stack unwinding on Solaris?

For live patching, if you want to patch inline function, one definitely needs
to look for places it has been inlined to. However in the situation the
function got offlined, I think live patching should just work, since it will
place jump in the beggining of function body.

The logic for creating local aliases is in ipa-visibility.cc.  Adding command
line option to control it is not hard. There are other transformations we do
there - like breaking up comdat groups and other things.

part aliases are controlled by -fno-partial-inlining, isra by -fno-ipa-sra.
There is also ipa-cp controlled by -fno-ipa-prop.
We also do alises as part of openMP offlining and LTO partitioning that are
kind of mandatory (there is no way to produce correct code without them).

[Bug ipa/113422] Missed optimizations in the presence of pointer chains

2024-01-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113422

--- Comment #2 from Jan Hubicka  ---
Cycling read-only var discovery would be quite expensive, since you need to
interleave it with early opts each round.  I wonder how llvm handles this?

I think there is more hope with IPA-PTA getting scalable version at -O2 and
possibly being able to solve this.

[Bug ipa/113520] ICE with mismatched types with LTO (tree check: expected array_type, have integer_type in array_ref_low_bound)

2024-01-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113520

--- Comment #8 from Jan Hubicka  ---
I think the ipa-cp summaries should be used only when types match. At least
Martin added type streaming for all the jump functions.  So we are missing some
check?

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Jan Hubicka  ---
Fixed.

[Bug c++/109753] [13/14 Regression] pragma GCC target causes std::vector not to compile (always_inline on constructor)

2024-01-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109753

--- Comment #12 from Jan Hubicka  ---
I think this is a problem with two meanings of always_inline. One is "it must
be inlined or otherwise we will not be able to generate code" other is
"disregard inline limits".

I guess practical solution here would be to ingore always inline for functions
called from static construction wrappers (since they only optimize around array
of function pointers). Question is how to communicate this down from FE to
ipa-inline...

[Bug middle-end/79704] [meta-bug] Phoronix Test Suite compiler performance issues

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704
Bug 79704 depends on bug 109811, which changed state.

Bug 109811 Summary: libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #19 from Jan Hubicka  ---
I think we can declare this one fixed.

[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236

Jan Hubicka  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-01-05
 CC||hubicka at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #2 from Jan Hubicka  ---
On zen3 I get 0.75MP/s for GCC and 0.80MP/s for clang, so only 6.6%, but seems
reproducible.

Profile looks comparable:

gcc
  30.96%  cwebplibwebp.so.7.1.5   [.]
GetCombinedEntropyUnre
  26.19%  cwebplibwebp.so.7.1.5   [.] VP8LHashChainFill 
   3.34%  cwebplibwebp.so.7.1.5   [.]
CalculateBestCacheSize
   3.30%  cwebplibwebp.so.7.1.5   [.]
CombinedShannonEntropy
   3.21%  cwebplibwebp.so.7.1.5   [.]
CollectColorBlueTransf

clang:

  34.06%  cwebplibwebp.so.7.1.5[.] GetCombinedEntropy   
  28.95%  cwebplibwebp.so.7.1.5[.] VP8LHashChainFill
   5.37%  cwebplibwebp.so.7.1.5[.]
VP8LGetBackwardReferences
   4.39%  cwebplibwebp.so.7.1.5[.]
CombinedShannonEntropy_SS
   4.28%  cwebplibwebp.so.7.1.5[.]
CollectColorBlueTransform


In the first loop clang seems to ifconvert while GCC doesn't:
  0.59 │   lea  kSLog2Table,%rdi
  3.69 │   vmovss   (%rdi,%rax,4),%xmm0
  0.98 │ 6f:   vcvtsi2ss%edx,%xmm2,%xmm1
  0.63 │   vfnmadd213ss 0x0(%r13),%xmm0,%xmm1
 38.16 │   vmovss   %xmm1,0x0(%r13)
  5.48 │   cmp  %r12d,0xc(%r13)
  0.06 │ ↓ jae  89 
   │   mov  %r12d,0xc(%r13)
  0.99 │ 89:   mov  0x4(%r13),%edi 
  0.96 │ 8d:   xor  %eax,%eax  
  0.40 │   test %r12d,%r12d
  0.60 │   setne%al 



   │   vcvtsd2ss%xmm0,%xmm0,%xmm1   
  0.02 │362:   mov  %r15d,%eax  
  0.57 │   imul %r12d,%eax  
  0.00 │   cmp  %r12d,%r9d  
  0.03 │   cmovbe   %r12d,%r9d  
  0.02 │   vmovd%eax,%xmm0  
  0.08 │   vpinsrd  $0x1,%r15d,%xmm0,%xmm0  
  1.50 │   vpaddd   %xmm0,%xmm4,%xmm4   
  1.08 │   vcvtsi2ss%r15d,%xmm5,%xmm0   
  0.87 │   vfnmadd231ss %xmm0,%xmm1,%xmm3   
  5.40 │   vmovaps  %xmm3,%xmm0 
  0.02 │38c:   xor  %eax,%eax   
  0.16 │   cmp  $0x4,%r15d

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #6 from Jan Hubicka  ---
The internal loops are:

static const unsigned keccakf_rotc[24] = {
   1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18,
39, 61, 20, 44
}; 

static const unsigned keccakf_piln[24] = {
   10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14,
22, 9, 6, 1
};

static void keccakf(ulong64 s[25])
{  
   int i, j, round;
   ulong64 t, bc[5];

   for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) {
  /* Theta */
  for(i = 0; i < 5; i++)
 bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20];

  for(i = 0; i < 5; i++) { 
 t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1);
 for(j = 0; j < 25; j += 5)
s[j + i] ^= t;
  }
  /* Rho Pi */
  t = s[1];
  for(i = 0; i < 24; i++) {
 j = keccakf_piln[i];
 bc[0] = s[j];
 s[j] = ROL64(t, keccakf_rotc[i]);
 t = bc[0];
  }
  /* Chi */
  for(j = 0; j < 25; j += 5) {
 for(i = 0; i < 5; i++)
bc[i] = s[j + i];
 for(i = 0; i < 5; i++)
s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5];
  }
  s[0] ^= keccakf_rndc[round];
   }
}

I suppose with complete unrolling this will propagate, partly stay in registers
and fold. I think increasing the default limits, especially -O3 may make sense.
Value of 16 is there for very long time (I think since the initial
implementation).

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark
   |is almost 40% slower vs.|is almost 40% slower vs.
   |Clang   |Clang (not enough complete
   ||loop peeling)

--- Comment #5 from Jan Hubicka  ---
On my zen3 machine default build gets me 180MB/S
-O3 -flto -funroll-all-loops gets me 193MB/s
-O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone
with --param max-completely-peel-times=20, default is 16.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #4 from Jan Hubicka  ---
I keep mentioning to Larabel that he should use -fno-semantic-interposition,
but he doesn't.

Profile is very simple:

 96.75%  SMHasher[.] keccakf.lto_priv.0
  ◆

All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get:

  3.85 │330:   mov%r8,%rdi  
  7.68 │   movslq (%rsi,%r9,1),%rcx 
  3.85 │   lea(%rax,%rcx,8),%r10
  3.86 │   mov(%rdx,%r9,1),%ecx 
  3.83 │   add$0x4,%r9  
  3.86 │   mov(%r10),%r8
  7.37 │   rol%cl,%rdi  
  7.37 │   mov%rdi,(%r10)   
  4.76 │   cmp$0x60,%r9 
  0.00 │ ↑ jne330   


Clang seems to unroll it:

 0.25 │ d0:   mov  -0x48(%rsp),%rdx
  ▒
  0.25 │   xor  %r12,%rcx  
   ▒
  0.25 │   mov  %r13,%r12  
   ▒
  0.25 │   mov  %r13,0x10(%rsp)
   ▒
  0.25 │   mov  %rax,%r13  
   ◆
  0.26 │   xor  %r15,%r13  
   ▒
  0.23 │   mov  %r11,-0x70(%rsp)   
   ▒
  0.25 │   mov  %r8,0x8(%rsp)  
   ▒
  0.25 │   mov  %r15,-0x40(%rsp)   
   ▒
  0.25 │   mov  %r10,%r15  
   ▒
  0.26 │   mov  %r10,(%rsp)
   ▒
  0.26 │   mov  %r14,%r10  
   ▒
  0.25 │   xor  %r12,%r10  
   ▒
  0.26 │   xor  %rsi,%r15  
   ▒
  0.24 │   mov  %rbp,-0x80(%rsp)   
   ▒
  0.25 │   xor  %rcx,%r15  
   ▒
  0.26 │   mov  -0x60(%rsp),%rcx   
   ▒
  0.25 │   xor  -0x68(%rsp),%r15   
   ▒
  0.26 │   xor  %rbp,%rdx  
   ▒
  0.25 │   mov  -0x30(%rsp),%rbp   
   ▒
  0.25 │   xor  %rdx,%r13  
   ▒
  0.24 │   mov  -0x10(%rsp),%rdx   
   ▒
  0.25 │   mov  %rcx,%r12  
   ▒
  0.24 │   xor  %rcx,%r13  
   ▒
  0.25 │   mov  $0x1,%ecx  
   ▒
  0.25 │   xor  %r11,%rdx  
   ▒
  0.24 │   mov  %r8,%r11   
   ▒
  0.25 │   mov  -0x28(%rsp),%r8
   ▒
  0.26 │   xor  -0x58(%rsp),%r8
   ▒
  0.24 │   xor  %rdx,%r8   
   ▒
  0.26 │   mov  -0x8(%rsp),%rdx
   ▒
  0.25 │   xor  %rbp,%r8   
   ▒
  0.26 │   xor  %r11,%rdx  
   ▒
  0.25 │   mov  -0x20(%rsp),%r11   
   ▒
  0.25 │   xor  %rdx,%r10  
   ▒

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2024-01-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

--- Comment #23 from Jan Hubicka  ---
Created attachment 56970
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56970&action=edit
Patch I am testing

Hi,
this adds -falign-all-functions parameter.  It still look like more reasonable
(and backward compatible) thing to do.  I also poked about Richi's suggestion
of extending the syntax of -falign-functions but I think it is less readable.

[Bug ipa/92606] [11/12/13 Regression][avr] invalid merge of symbols in progmem and data sections

2023-12-12 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92606

--- Comment #31 from Jan Hubicka  ---
This is Maritn's code, but I agree that equals_wpa should reject pairs with
"dangerous" attributes on them (ideally we should hash them). 
I think we could add test for same attributes to equals_wpa and eventually
white list attributes we consider mergeable?
There are attributes that serves no meaning once we enter backend, so it may be
also good option to strip them, so they are not confusing passes like ICF.

[Bug ipa/81323] IPA-VRP doesn't handle return values

2023-12-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81323

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #9 from Jan Hubicka  ---
Note that  r14-5628-g53ba8d669550d3 does just the easy part of propagating
within single translation unit. We will need to add actual IPA bits into WPA
next stage1

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-12-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #18 from Jan Hubicka  ---
Reading all the discussion again, I am leaning towards -falign-all-functions +
documentation update explaining that -falign-functions/-falign-loops are
optimizations and ignored for -Os.

I do use -falign-functions/-falign-loops when tuning for new generations of
CPUs and I definitely want to have way to specify alignment that is ignored for
cold functions (as perforance optimization) and we have this behavior since
profile code was introduced in 2002.

As an optimization, we also want to have hot functions aligned more than 8 byte
boundary needed for patching.

I will prepare patch for this and send it for disucssion.  Pehraps we want
-flive-patching to also imply FUNCTION_BOUNDARY increase on x86-64? Or is live
patching useful if function entries are not aligned?

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #11 from Jan Hubicka  ---
trunk -O3 -flto -march=native -fopenmp
Operation: Sharpen:
257
256
256

Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
257
256
256

Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
   Operation: Sharpen:
257
256
256
Average: 256 Iterations Per Minute

So I guess I will need to try on zen3 to see if there is any difference.

the internal loop is:
  0.00 │460:┌─→movzbl  0x2(%rdx,%rax,4),%esi ▒
  0.02 ││  vmovss  (%r8,%rax,4),%xmm2▒
  0.95 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 20.22 ││  movzbl  0x1(%rdx,%rax,4),%esi ▒
  0.01 ││  vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
 11.97 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 18.76 ││  movzbl  (%rdx,%rax,4),%esi▒
  0.00 ││  inc %rax  ▒
  0.72 ││  vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
 12.55 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 14.95 ││  vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
 15.93 │├──cmp %rax,%r13 ▒
  0.35 │└──jne 460  

so it still does not get

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #18 from Jan Hubicka  ---
I made a typo:

Mainline with -O2 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

Should be 
Mainline with -O3 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

So with -O2 we still get slightly lower score than clang with -O3 we are
slightly better. push_back inlining does not seem to be a problem (as tested by
increasing limits) so perhaps more agressive unrolling/vectorization settings
clang has at -O2.

I think upstream jpegxl should use -O3 or -Ofast instead of -O2.  It is quite
typical kind of task that benefits from large optimization levels.

I filled in https://github.com/libjxl/libjxl/issues/2970

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #20 from Jan Hubicka  ---
On zen4 hardware I now get

GCC13 with -O3 -flto -march=native -fopenmp
2163
2161
2153

Average: 2159 Iterations Per Minute

clang 17 with -O3 -flto -march=native -fopenmp
2004
1988
1991

Average: 1994 Iterations Per Minute

trunk -O3 -flto -march=native -fopenmp
Operation: Resizing:
2126
2135
2123

Average: 2128 Iterations Per Minute

So no big changes here...

[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #8 from Jan Hubicka  ---
On ARM32 and other targets methods returns this pointer.  Togher with making
return value escape this probably completely disables any chance for IPA
tracking of C++ data types...

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #10 from Jan Hubicka  ---
runtimes on zen4 hardware.

trunk -O3 -flto -march-native
42171
42964
42106
clang -O3 -flto -march=native
37393
37423
37508
gcc 13 -O3 -flto -march=native
42380
42314
43285

So seems the performance did not change

  1   2   3   4   5   6   7   8   >