[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-05-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

Jan Hubicka  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13/14 Regression] Wrong
   |Wrong code at -O with   |code at -O with ipa-modref
   |ipa-modref on aarch64   |on aarch64

--- Comment #22 from Jan Hubicka  ---
Fixed on trunk so far

[Bug libstdc++/109442] Dead local copy of std::vector not removed from function

2024-05-11 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109442

--- Comment #19 from Jan Hubicka  ---
Note that the testcase from PR115037 also shows that we are not able to
optimize out dead stores to the vector, which is another quite noticeable
problem.

void
test()
{
std::vector test;
test.push_back (1);
}

We alocate the block, store 1 and immediately delete it.
void test ()
{
  int * test$D25839$_M_impl$D25146$_M_start;
  struct vector test;
  int * _61;

   [local count: 1073741824]:
  _61 = operator new (4);

   [local count: 1063439392]:
  *_61 = 1;
  operator delete (_61, 4);
  test ={v} {CLOBBER};
  test ={v} {CLOBBER(eol)};
  return;

   [count: 0]:
:
  test ={v} {CLOBBER};
  resx 2

}

So my understanding is that we decided to not optimize away the dead stores
since the particular operator delete does not pass test:

  /* If the call is to a replaceable operator delete and results
 from a delete expression as opposed to a direct call to
 such operator, then we can treat it as free.  */
  if (fndecl
  && DECL_IS_OPERATOR_DELETE_P (fndecl)
  && DECL_IS_REPLACEABLE_OPERATOR (fndecl)
  && gimple_call_from_new_or_delete (stmt))
return ". o ";

This is because we believe that operator delete may be implemented in an insane
way that inspects the values stored in the block being freed.

I can sort of see that one can write standard conforming code that allocates
some data that is POD and inspects it in destructor.
However for std::vector this argument is not really applicable. Standard does
specify that new/delete is used to allocate/deallocate the memory but does not
say how the memory is organized or what happens before deallocation.
(i.e. it is probably valid for std::vector to memset the block just before
deallocating it).

Similar argument can IMO be used for eliding unused memory allocations. It is
kind of up to std::vector implementation on how many allocations/deallocations
it does, right?

So we need a way to annotate the new/delete calls in the standard library as
safe for such optimizations (i.e. implement clang's
__bulitin_operator_new/delete?)

How clang manages to optimize this out without additional hinting?

[Bug middle-end/115037] Unused std::vector is not optimized away.

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037

Jan Hubicka  changed:

   What|Removed |Added

 CC||jason at redhat dot com,
   ||jwakely at redhat dot com

--- Comment #2 from Jan Hubicka  ---
I tried to look for duplicates, but did not find one.
However I think the first problem is that we do not optimize away the store of
1 to vector while clang does.  I think this is because we do not believe we can
trust that delete operator is safe?

We get:
void test ()
{
  int * test$D25839$_M_impl$D25146$_M_start;
  struct vector test;
  int * _61;

   [local count: 1073741824]:
  _61 = operator new (4);

   [local count: 1063439392]:
  *_61 = 1;
  operator delete (_61, 4);
  test ={v} {CLOBBER};
  test ={v} {CLOBBER(eol)};
  return;

   [count: 0]:
:
  test ={v} {CLOBBER};
  resx 2

}
If we can not trust fact that operator delete is good, perhaps we can arrange
explicit clobber before calling it? I think it is up to std::vector to decide
what it will do with the stored array so in this case even instane oprator
delete has no right to expect that the data in vector will be sane :)

[Bug middle-end/115037] New: Unused std::vector is not optimized away.

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037

Bug ID: 115037
   Summary: Unused std::vector is not optimized away.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Compiling 
#include 
void
test()
{
std::vector test;
test.push_back (1);
}

leads to

_Z4testv:
.LFB1253:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
movl$4, %edi
call_Znwm
movl$4, %esi
movl$1, (%rax)
movq%rax, %rdi
addq$8, %rsp
.cfi_def_cfa_offset 8
jmp _ZdlPvm

while clang optimizes to:

_Z4testv:   # @_Z4testv
.cfi_startproc
# %bb.0:
retq

[Bug middle-end/115036] New: division is not shortened based on value range

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115036

Bug ID: 115036
   Summary: division is not shortened based on value range
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

For
long test(long a, long b)
{
if (a > 65535 || a < 0)
__builtin_unreachable ();
if (b > 65535 || b < 0)
__builtin_unreachable ();
return a/b;
}

we produce
test:
.LFB0:
.cfi_startproc
movq%rdi, %rax
cqto
idivq   %rsi
ret

while clang does:

test:   # @test
.cfi_startproc
# %bb.0:
movq%rdi, %rax
# kill: def $ax killed $ax killed $rax
xorl%edx, %edx
divw%si
movzwl  %ax, %eax
retq

clang also by default adds 32bit divide path even when value range is not known

long test(long a, long b)
{
return a/b;
}

compiles as

test:   # @test
.cfi_startproc
# %bb.0:
movq%rdi, %rax
movq%rdi, %rcx
orq %rsi, %rcx
shrq$32, %rcx
je  .LBB0_1
# %bb.2:
cqto
idivq   %rsi
retq

[Bug ipa/114985] [15 regression] internal compiler error: in discriminator_fail during stage2

2024-05-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114985

--- Comment #14 from Jan Hubicka  ---
So this is problem in ipa_value_range_from_jfunc?
It is Maritn's code, I hope he will know why types are wrong here.
Once can get type compatibility problem on mismatched declarations and LTO, but
it seems that this testcase is single-file. So indeed this looks like a bug
either in jump function construction or even earlier...

[Bug middle-end/114852] New: jpegxl 10.0.1 is faster with clang18 then with gcc14

2024-04-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114852

Bug ID: 114852
   Summary: jpegxl 10.0.1 is faster with clang18 then with gcc14
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3
reports about 8% difference.  I can measure 13% on zen3.  The code has changed
and it is no longer bound by push_back but runs AVX2 version of inner loops.

The hottest loops looks comparable

  0.00 │266:┌─→vmovaps  (%r14,%rax,4),%ymm0
  0.11 ││  vmulps   (%rcx,%rax,4),%ymm7,%ymm2
  1.18 ││  vfnmadd213ps (%rsi,%rax,4),%ymm11,%ymm0
  0.25 ││  vmulps   %ymm2,%ymm0,%ymm0
  5.94 ││  vroundps $0x8,%ymm0,%ymm2
  0.35 ││  vsubps   %ymm2,%ymm0,%ymm0
  1.05 ││  vmulps   (%rdx,%rax,4),%ymm0,%ymm0
  3.19 ││  vmovaps  %ymm0,0x0(%r13,%rax,4)
  0.15 ││  vandps   %ymm10,%ymm2,%ymm0
  0.03 ││  add  $0x8,%rax
  0.03 ││  vcmpeqps %ymm8,%ymm0,%ymm2
  0.09 ││  vsqrtps  %ymm0,%ymm0
 27.25 ││  vaddps   %ymm0,%ymm6,%ymm6
  0.35 ││  vandnps  %ymm9,%ymm2,%ymm0
  0.12 ││  vaddps   %ymm0,%ymm5,%ymm5
  0.05 │├──cmp  %r12,%rax
  0.02 │└──jb   266

and clang

  0.00 │ c90:┌─→vmulps   (%r9,%rdx,4),%ymm0,%ymm2
  0.97 │ │  vmovaps  (%r15,%rdx,4),%ymm1
  0.36 │ │  vsubps   %ymm2,%ymm1,%ymm1
  4.24 │ │  vmulps   (%rcx,%rdx,4),%ymm4,%ymm2
  1.92 │ │  vmulps   %ymm2,%ymm1,%ymm1
  0.65 │ │  vroundps $0x8,%ymm1,%ymm2
  0.06 │ │  vsubps   %ymm2,%ymm1,%ymm1
  1.11 │ │  vmulps   (%rax,%rdx,4),%ymm1,%ymm1
  3.53 │ │  vmovaps  %ymm1,(%rsi,%rdx,4)
  0.68 │ │  vandps   %ymm6,%ymm2,%ymm1
  0.23 │ │  vcmpneqps%ymm5,%ymm2,%ymm2
  3.64 │ │  add  $0x8,%rdx
  0.24 │ │  vsqrtps  %ymm1,%ymm1
 22.16 │ │  vaddps   %ymm1,%ymm8,%ymm8
  0.25 │ │  vbroadcastss 0x31eba5(%rip),%ymm1# 34f840

  0.05 │ │  vandps   %ymm1,%ymm2,%ymm1
  0.04 │ │  vaddps   %ymm1,%ymm7,%ymm7
  0.11 │ ├──cmp  %rdi,%rdx
  0.07 │ └──jb   c90▒

GCC profile:
  10.78%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::EstimateEntropy(jxl::AcStrategy const&, float, unsigned long,
unsigned long, jxl::ACSConfig const&, float con
   7.02%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::FindBestMultiplier(float const*, float const*, unsigned long,
float, float, bool) [clone .part.0]
   4.50%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::Symmetric5Row(jxl::Plane const&,
jxl::RectT const&, long, jxl:
   4.47%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::TransformFromPixels(jxl::AcStrategy::Type,
float const*, unsigned long, float*, float*
   4.31%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::TransformToPixels(jxl::AcStrategy::Type,
float*, float*, unsigned long, float*)
   4.00%  cjxl libjxl.so.0.10.1   [.]
jxl::ThreadPool::RunCallState const&, int const* restrict*, jxl::AcStra
   3.56%  cjxl libm.so.6  [.] __ieee754_pow_fma
   3.49%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::IDCT1DImpl<8ul, 8ul>::operator()(float
const*, unsigned long, float*, unsigned long, f
   3.43%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous
namespace)::AdaptiveQuantizationImpl::ComputeTile(float, float,
jxl::Image3 const&, jxl::Re
   3.27%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::(anonymous namespace)::DCT1DWrapper<32ul, 0ul,
jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2:
   3.16%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<8ul, 8ul>::operator()(float*,
float*) [clone .isra.0]
   2.87%  cjxl libjxl.so.0.10.1   [.] void
jxl::N_AVX2::(anonymous namespace)::ComputeScaledIDCT<4ul,
8ul>::operator()::operator()::operator() const&, jxl::RectT
const&, jxl::DequantMatrices const&, jxl::AcStrategyImage const*,
jxl::Plane const*, jxl::Quantizer const*, jxl::Rect▒
   5.03%  cjxl libjxl.so.0.10.1   [.]
jxl::ThreadPool::RunCallState const&, jxl::RectT
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*, jxl::Pla▒
   4.66%  cjxl libjxl.so.0.10.1   [.]
jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<16ul, 8ul>::operator()(float*,
float*)
   ▒
   4.56%  cjxl libjxl.so.0.10.1   [.]

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #9 from Jan Hubicka  ---
Phoronix still claims the difference
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2

[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236

--- Comment #3 from Jan Hubicka  ---
Seems this perofmance difference is still there on zen4
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3

[Bug tree-optimization/114787] [13 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787

--- Comment #18 from Jan Hubicka  ---
predict.cc queries number of iterations using number_of_iterations_exit and
loop_niter_by_eval and finally using estimated_stmt_executions.

The first two queries are not updating the upper bounds datastructure so that
is why we get around without computing them in some cases.

I guess we can just drop dumping here. We now dump the recorded estimates
elsehwere, so this is somewhat redundant.

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #13 from Jan Hubicka  ---
Thanks a lot, looks great!
Do we still auto-detect memmove when the copy constructor turns out to be
memcpy equivalent after optimization?

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #9 from Jan Hubicka  ---
Your patch gives me error compiling testcase

jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C 
In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65,
 from /home/jh/t.C:1:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In
instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator,
_InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const
pair*; _ForwardIterator = pair*; _Allocator = allocator >;
_Traits = allocator_traits > >]’:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31:  
required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with
_Tp = pair; _Up = pair]’
 1127 |   return std::__relocate_a(__cfirst, __clast, __result, __alloc);
  |  ~^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26:   required
from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp,
_Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; pointer =
std::pair*; _Tp_alloc_type =
std::vector >::_Tp_alloc_type]’
  509 | return std::__relocate_a(__first, __last, __result, __alloc);
  |~^~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32:   required
from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args
= {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator
>]’
  647 | __new_finish = _S_relocate(__old_start, __old_finish,
  |~~~^~~
  648 |__new_start,
_M_get_Tp_allocator());
  |   
~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21:   required
from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; value_type =
std::pair]’
 1294 |   _M_realloc_append(__x);
  |   ~^
/home/jh/t.C:8:25:   required from here
8 | stack.push_back (pair);
  | ^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with
_Tp = pair]’
 1084 | 
std::addressof(std::move(*__first
  | 
~~^
In file included from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61,
 from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64,
 from /home/jh/trunk-install/include/c++/14.0.1/vector:62:
/home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared
here
  168 | const _Tp* addressof(const _Tp&&) = delete;
  |^
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
note: use ‘-fdiagnostics-all-candidates’ to display considered candidates
 1084 | 
std::addressof(std::move(*__first
  | 
~~^


It is easy to check if conversion happens - just compile it and see if there is
memcpy or memmove in the optimized dump file (or final assembly)

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #8 from Jan Hubicka  ---
I had wrong noexcept specifier.  This version works, but I still need to inline
relocate_object_a into the loop

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..f02d4fb878f 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  "relocation is only possible for values of the same type");
   _ForwardIterator __cur = __result;
   for (; __first != __last; ++__first, (void)++__cur)
-   std::__relocate_object_a(std::__addressof(*__cur),
-std::__addressof(*__first), __alloc);
+   {
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, std::__addressof(*__cur),
std::move(*std::__addressof(*__first)));
+ __traits::destroy(__alloc,
std::__addressof(*std::__addressof(*__first)));
+   }
   return __cur;
 }

@@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc)))
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #6 from Jan Hubicka  ---
Thanks. I though the relocate_a only cares about the fact if the pointed-to
type can be bitwise copied.  It would be nice to early produce memcpy from
libstdc++ for std::pair, so the second patch makes sense to me (I did not test
if it works)

I think it would be still nice to tell GCC that the copy loop never gets
overlapping memory locations so the cases which are not early optimized to
memcpy can still be optimized later (or vectorized if it does really something
non-trivial).

So i tried your second patch fixed so it compiles:
diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..0d2e588ae5e 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(std::__is_bitwise_relocatable<_Tp>::value)
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

it does not make ldist to hit, so the restrict info is still lost.  I think the
problem is that if you call relocate_object the restrict reduces scope, so we
only know that the elements are pairwise disjoint, not that the vectors are.
This is because restrict is interpreted early pre-inlining, but it is really
Richard's area.

It seems that the patch makes us to go through __uninitialized_copy_a instead
of __uninit_copy. I am not even sure how these are different, so I need to
stare at the code bit more to make sense of it :)

[Bug middle-end/114822] New: ldist should produce memcpy/memset/memmove histograms based on loop information converted

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114822

Bug ID: 114822
   Summary: ldist should produce memcpy/memset/memmove histograms
based on loop information converted
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

When loop is converted to string builtin we lose information about its size.
This means that we won't expand it inline when the block size is expected to be
small.  This causes performance problem i.e. on std::vector and testcase from
PR114821  which at least with profile feedback runs significantly slower than
variant where memcpy is produced early


#include 
typedef unsigned int uint32_t;
int pair;
void
test()
{
std::vector stack;
stack.push_back (pair);
while (!stack.empty()) {
int cur = stack.back();
stack.pop_back();
if (true)
{
cur++;
stack.push_back (cur);
stack.push_back (cur);
}
if (cur > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #2 from Jan Hubicka  ---
What I am shooting for is to optimize it later in loop distribution. We can
recognize memcpy loop if we can figure out that source and destination memory
are different.

We can help here with restrict, but I was bit lost in how to get them done.

This seems to do the trick, but for some reason I get memmove

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..1a6223ea892 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
   return __result + __count;
 }
+
+  template 
+_GLIBCXX20_CONSTEXPR
+inline __enable_if_t::value, _Tp*>
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc) noexcept
+{
+  ptrdiff_t __count = __last - __first;
+  if (__count > 0)
+   {
+#ifdef __cpp_lib_is_constant_evaluated
+ if (std::is_constant_evaluated())
+   {
+ for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict
qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+ return __result;
+   }
 #endif
+ __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
+   }
+  return __result + __count;
+}
+#endif
+
+  template 
+_GLIBCXX20_CONSTEXPR
+#if _GLIBCXX_HOSTED
+inline __enable_if_t::value, _Tp*>
+#else
+inline _Tp *
+#endif
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc)
+noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc,
+__result, std::move(*__first)))
+&& noexcept(std::allocator_traits<_Allocator>::destroy(
+   __alloc, std::__addressof(*__first
+{
+  for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+  return __result;
+}

   template 

[Bug libstdc++/114821] New: _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

Bug ID: 114821
   Summary: _M_realloc_append should use memcpy instead of loop to
copy data when possible
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

In thestcase

#include 
typedef unsigned int uint32_t;
std::pair pair;
void
test()
{
std::vector> stack;
stack.push_back (pair);
while (!stack.empty()) {
std::pair cur = stack.back();
stack.pop_back();
if (!cur.first)
{
cur.second++;
stack.push_back (cur);
stack.push_back (cur);
}
if (cur.second > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

We produce _M_reallloc_append which uses loop to copy data instead of memcpy.
This is bigger and slower.  The reason why __relocate_a does not use memcpy
seems to be fact that pair has copy constructor. It still can be pattern
matched by ldist but it fails with:

(compute_affine_dependence
  ref_a: *__first_1, stmt_a: *__cur_37 = *__first_1;
  ref_b: *__cur_37, stmt_b: *__cur_37 = *__first_1;
) -> dependence analysis failed

So we can not disambiguate old and new vector memory and prove that loop is
indeed memcpy loop. I think this is valid since operator new is not required to
return new memory, but I think adding __restrict should solve this.

Problem is that I got lost on where to add them, since relocate_a uses
iterators instead of pointers

[Bug tree-optimization/114787] [13/14 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)

2024-04-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787

--- Comment #13 from Jan Hubicka  ---
-fdump-tree-all-all  changing generated code is also bad.  We probably should
avoid dumping loop bounds then they are not recorded. I added dumping of loop
bounds and this may be unexpected side effect. WIll take a look.

[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline

2024-04-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008

--- Comment #8 from Jan Hubicka  ---
Note that cold attribute is also quite strong since it turns optimize_size
codegen that is often a lot slower.

Reading the discussion again, I don't think we have a way to make inline
keyword ignored by inliner.  We can make not_really_inline attribute (better
name would be welcome).

[Bug tree-optimization/114779] __builtin_constant_p does not work in inline functions

2024-04-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114779

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #7 from Jan Hubicka  ---
Note that the test about side-effects also makes it impossible to test for
constantness of values passed to function by reference which could be also
useful. Workaround is to load it into temporary so the side-effect is not seen.
So that early folding to 0 never made too much of sense to me.

I agree that it is a can of worms and it is not clear if changing behaviour
would break things...

[Bug middle-end/114774] Missed DSE in simple code due to interleaving sotres

2024-04-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

Jan Hubicka  changed:

   What|Removed |Added

Summary|Missed DSE in simple code   |Missed DSE in simple code
   |due to other stores being   |due to interleaving sotres
   |conditional |

--- Comment #1 from Jan Hubicka  ---
the other store being conditional is not the core issue. Here we miss DSE too:

#include 
int a;
short p,q;
void
test (int b)
{
a=1;
if (b)
  p++;
else
  q++;
a=2;
}

The problem in DSE seems to be that instead of recursively walking the
memory-SSA graph it insist the graph to form a chain. Now SRA leaves stores to
scalarized variables and even removes the corresponding clobbers, so this is
relatively common scenario in non-trivial C++ code.

[Bug middle-end/114774] New: Missed DSE in simple code

2024-04-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

Bug ID: 114774
   Summary: Missed DSE in simple code
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

In the following

#include 
int a;
short *p;
void
test (int b)
{
a=1;
if (b)
{
(*p)++;
a=2;
printf ("1\n");
}
else 
{
(*p)++;
a=3;
printf ("2\n");
}
}

We are not able to optimize out "a=1". This is simplified real-world scenario
where SRA does not remove definition of SRAed variables.

Note that clang does conditional move here
test:   # @test
.cfi_startproc
# %bb.0:
movqp(%rip), %rax
incw(%rax)
xorl%eax, %eax
testl   %edi, %edi
leaq.Lstr(%rip), %rcx
leaq.Lstr.2(%rip), %rdi
cmoveq  %rcx, %rdi
sete%al
orl $2, %eax
movl%eax, a(%rip)
jmp puts@PLT# TAILCALL

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #19 from Jan Hubicka  ---
I looked into the remaining exit/nonexit rename discussed here earlier before
the PR was closed. The following patch would restore the code to do the same
calls as before my patch
PR tree-optimization/109596
* tree-ssa-loop-ch.c (ch_base::copy_headers): Fix use of exit/nonexit
edges.
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index b7ef485c4cc..cd5f6bc3c2a 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -952,13 +952,13 @@ ch_base::copy_headers (function *fun)
   if (!single_pred_p (nonexit->dest))
{
  header = split_edge (nonexit);
- exit = single_pred_edge (header);
+ nonexit = single_pred_edge (header);
}

   edge entry = loop_preheader_edge (loop);

   propagate_threaded_block_debug_into (nonexit->dest, entry->dest);
-  if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
+  if (!gimple_duplicate_seme_region (entry, nonexit, bbs, n_bbs,
copied_bbs,
 true))
{
  delete candidate.static_exits;

I however convinced myself this is an noop. both exit and nonexit sources have
same basic blocks.  

propagate_threaded_block_debug_into walks predecessors of its first parameter
and moves debug statements to the second parameter, so it does the same job,
since the split BB is empty.

gimple_duplicate_seme_region uses the parametr to update loop header but it
does not do that correctly for loop header copying and we re-do it in
tree-ssa-loop-ch.

Still the code as it is now in trunk is very confusing, so perhaps we should
update it?

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #28 from Jan Hubicka  ---
So the main problem is that in t2 we have

_ZN6vectorI12QualityValueEC1ERKS1_/7 (vector<_Tp>::vector(const vector<_Tp>&)
[with _Tp = QualityValue])
  Type: function definition analyzed alias cpp_implicit_alias
  Visibility: semantic_interposition public weak comdat
comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only
  Same comdat group as: _ZN6vectorI12QualityValueEC2ERKS1_/6
  References: _ZN6vectorI12QualityValueEC2ERKS1_/6 (alias) 
  Referring: 
  Function flags:
  Called by: _Z41__static_initialization_and_destruction_0v/8 (can throw
external)
  Calls: 

and in t1 we have

_ZN6vectorI12QualityValueEC1ERKS1_/2 (constexpr vector<_Tp>::vector(const
vector<_Tp>&) [with _Tp = QualityValue])
  Type: function definition
  Visibility: semantic_interposition external public weak comdat
comdat_group:_ZN6vectorI12QualityValueEC1ERKS1_ one_only
  References: 
  Referring:
  Function flags:
  Called by: 
  Calls: 

This is the same symbol name but in two different comdat groups (C1 compared to
C5).  With -O0 both seems to get the C5 group

I can silence the ICE by making aliases undefined during symbol merging (which
is kind of hack but should make sanity checks happy), but I am still lost how
this is supposed to work in valid code.

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #27 from Jan Hubicka  ---
OK, but the problem is same. Having comdats with same key defining different
set of public symbols is IMO not a good situation for both non-LTO and LTO
builds.
Unless the additional alias is never used by valid code (which would make it
useless and probably we should not generate it) it should be possible to
produce a scenario where linker will pick wrong version of comdat and we get
undefined symbol in non-LTO builds...

[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0

2024-04-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208

--- Comment #25 from Jan Hubicka  ---
So we have comdat groups that diverges in t1.o and t2.o.  In one object it has
alias in it while in other object it does not

Merging nodes for _ZN6vectorI12QualityValueEC2ERKS1_. Candidates:
_ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition prevailing_def_ironly
public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only
  next sharing asm name: 19
  References: 
  Referring:  
  Read from file: t1.o
  Unit id: 1
  Function flags: count:1073741824 (estimated locally)
  Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw
external)
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00
per call) (can throw external)
_ZN6vectorI12QualityValueEC2ERKS1_/19 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition preempted_ir public
weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only
  Same comdat group as: _ZN6vectorI12QualityValueEC1ERKS1_/20
  previous sharing asm name: 1
  References: 
  Referring: _ZN6vectorI12QualityValueEC1ERKS1_/20 (alias)
  Read from file: t2.o
  Unit id: 2
  Function flags: count:1073741824 (estimated locally)
  Called by:
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/23 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/24 (1073741824 (estimated locally),1.00
per call) (can throw external)
After resolution:
_ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base )
  Type: function definition analyzed
  Visibility: externally_visible semantic_interposition prevailing_def_ironly
public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only
  next sharing asm name: 19
  References: 
  Referring: 
  Read from file: t1.o
  Unit id: 1
  Function flags: count:1073741824 (estimated locally)
  Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw
external)
  Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated
locally),1.00 per call) (can throw external)
_ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00
per call) (can throw external)

We opt for version without alias and later ICE in sanity check verifying that
aliases have same comdat group as their targets.

I wonder how this is ice-on-valid code, since with normal linking the aliased
symbol may or may not appear in the winning comdat group, so using he alias has
to break.

If constexpr changes how the constructor is generated, isn't this violation of
ODR?

We probably can go and reset every node in losing comdat group to silence the
ICE and getting undefined symbol instead

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-04-09 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #8 from Jan Hubicka  ---
I am not sure this ought to be P1:
 - the compilation technically is finite, but not in reasonable time
 - it is possible to adjust the testcas (do early inlining manually) and get
same infinite build on release branches
 - if you ask for inline bomb, you get it.

But after some more testing, I do not see reasonably easy way to get better
diagnostics. So I will retest the patch fro #6 and go ahead with it.

[Bug ipa/113359] [13/14 Regression] LTO miscompilation of ceph on aarch64 and x86_64

2024-04-04 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #23 from Jan Hubicka  ---
The patch looks reasonable.  We probably could hash the padding vectors at
summary generation time to reduce WPA overhead, but that can be done
incrementally next stage1.
I however wonder if we really guarantee to copy the paddings everywhere else
then the total scalarization part?
(i.e. in all paths through the RTL expansion)

[Bug ipa/109817] internal error in ICF pass on Ada interfaces

2024-04-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109817

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #5 from Jan Hubicka  ---
That check was added to verify that we do not lose the thunk annotations.  Now
when datastructure is stable, i think we can simply drop it, if that makes Ada
to work.

[Bug gcov-profile/113765] [14 Regression] ICE: autofdo: val-profiler-threads-1.c compilation, error: probability of edge from entry block not initialized

2024-03-26 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113765

--- Comment #6 from Jan Hubicka  ---
Running auto-fdo without guessing branch probabilities is somewhat odd idea in
general.  I suppose we can indeed just avoid setting full_profile flag. Though
the optimization passes are not that much tested to work with non-full profiles
so there is some risk that resulting code will be worse than without auto-FDO.

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #7 from Jan Hubicka  ---
Found it, probably. I renamed exit to nonexit (since name was misleading) and
then forgot to update
 propagate_threaded_block_debug_into (exit->dest, entry->dest);

I will check this after teaching (which I have in 10 mins)

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #6 from Jan Hubicka  ---
On this testcase trunk does get same dump as gcc13 for pass just before ch2

with ch2 we get:
@@ -192,9 +236,8 @@
   # DEBUG BEGIN_STMT
   goto ; [100.00%]

-   [local count: 954449105]:
+   [local count: 954449104]:
   # j_15 = PHI 
-  # DEBUG j => j_15
   # DEBUG BEGIN_STMT
   a[b_14][j_15] = 0;
   # DEBUG BEGIN_STMT
@@ -203,29 +246,30 @@
   # DEBUG j => j_9
   # DEBUG BEGIN_STMT
   if (j_9 <= 7)
-goto ; [88.89%]
+goto ; [87.50%]
   else
-goto ; [11.11%]
+goto ; [12.50%]

[local count: 119292720]:
+  # DEBUG j => 0
   # DEBUG BEGIN_STMT
   b_7 = b_14 + 1;
   # DEBUG b => b_7
   # DEBUG b => b_7
   # DEBUG BEGIN_STMT
   if (b_7 <= 6)
-goto ; [87.50%]
+goto ; [85.71%]
   else
-goto ; [12.50%]
+goto ; [14.29%]

[local count: 119292720]:
   # b_14 = PHI 
-  # DEBUG b => b_14
   # DEBUG j => 0
   # DEBUG BEGIN_STMT
   goto ; [100.00%]

[local count: 17041817]:
+  # DEBUG b => 0
   # DEBUG BEGIN_STMT
   optimize_me_not ();
   # DEBUG BEGIN_STMT


So in addition to updating BB profile, we indeed end up moving debug statements
around.

The change of dump is:
+  Analyzing: if (b_1 <= 6)
+Will eliminate peeled conditional in bb 6.
+May duplicate bb 6
+  Not duplicating bb 8: it is single succ.
+  Analyzing: if (j_2 <= 7)
+Will eliminate peeled conditional in bb 4.
+May duplicate bb 4
+  Not duplicating bb 3: it is single succ.
 Loop 2 is not do-while loop: latch is not empty.
+Duplicating header BB to obtain do-while loop
 Copying headers of loop 1
 Will duplicate bb 6
-  Not duplicating bb 8: it is single succ.
-Duplicating header of the loop 1 up to edge 6->8, 2 insns.
+Duplicating header of the loop 1 up to edge 6->7
 Loop 1 is do-while loop
 Loop 1 is now do-while loop.
+Exit count: 17041817 (estimated locally)
+Entry count: 17041817 (estimated locally)
+Peeled all exits: decreased number of iterations of loop 1 by 1.
 Copying headers of loop 2
 Will duplicate bb 4
-  Not duplicating bb 3: it is single succ.
-Duplicating header of the loop 2 up to edge 4->3, 2 insns.
+Duplicating header of the loop 2 up to edge 4->5
 Loop 2 is do-while loop
 Loop 2 is now do-while loop.
+Exit count: 119292720 (estimated locally)
+Entry count: 119292720 (estimated locally)
+Peeled all exits: decreased number of iterations of loop 2 by 1.

Dumps moved around, but we do same duplicaitons as before (BB6 and BB4 to
eliminate the conditionals).

   [local count: 1073741824]:
  # j_2 = PHI <0(8), j_9(3)>
  # DEBUG j => j_2
  # DEBUG BEGIN_STMT
  if (j_2 <= 7)
goto ; [88.89%]
  else
goto ; [11.11%]

   [local count: 136334537]:
  # b_1 = PHI <0(2), b_7(5)>
  # DEBUG b => b_1
  # DEBUG BEGIN_STMT
  if (b_1 <= 6)
goto ; [87.50%]
  else
goto ; [12.50%]

[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba

2024-03-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596

--- Comment #4 from Jan Hubicka  ---
The change makes loop iteration estimates more realistics, but does not
introduce any new code that actually changes the IL, so it seems this makes
existing problem more visible.  I will try to debug what happens.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #59 from Jan Hubicka  ---
just to explain what happens in the testcase.  There is test and testb. They
are almost same:

int
testb(void)
{
  struct bar *fp;
  test2 ((void *));
  fp = NULL;
  (*ptr)++;
  test3 ((void *));
}
the difference is in the alias set of FP. In one case it aliases with the
(*ptr)++ while in other it does not.  This makes one function to have jump
function specifying aggregate value of 0 for *fp, while other does not.

Now with LTO both struct bar and foo becomes compatible for TBAA, so the
functions gets merged and the winning variant has the jump function specifying
aggregate 0, which is wrong in the context code is invoked.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #58 from Jan Hubicka  ---
Created attachment 57702
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57702=edit
Compare value ranges in jump functions

This patch implements the jump function compare, however it is not good enough.
 Here is another wrong code:

jh@ryzen3:~/gcc/build/stage1-gcc> cat a.c
#include 
#include 

__attribute__((used)) int val,val2 = 1;

struct foo {int a;};

struct foo **ptr;

__attribute__ ((noipa))
int
test2 (void *a)
{ 
  ptr = (struct foo **)a;
}
int test3 (void *a);

int
test(void)
{ 
  struct foo *fp;
  test2 ((void *));
  fp = NULL;
  (*ptr)++;
  test3 ((void *));
}

int testb (void);

int
main()
{ 
  for (int i = 0; i < val2; i++)
  if (val)
testb ();
  else
test();
}
jh@ryzen3:~/gcc/build/stage1-gcc> cat b.c
#include 
struct bar {int a;};
struct foo {int a;};
struct barp {struct bar *f; struct bar *g;};
extern struct foo **ptr;
int test2 (void *);
int test3 (void *);
int
testb(void)
{
  struct bar *fp;
  test2 ((void *));
  fp = NULL;
  (*ptr)++;
  test3 ((void *));
}
jh@ryzen3:~/gcc/build/stage1-gcc> cat c.c
#include 
__attribute__ ((noinline))
int
test3 (void *a)
{
  if (!*(void **)a)
  abort ();
  return 0;
}
jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B
./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc  -B ./ b.o a.o c.o ; ./a.out
Aborted (core dumped)
jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B
./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc  -B ./ b.o a.o c.o
--disable-ipa-icf ; ./a.out 
lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]
lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-13 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #55 from Jan Hubicka  ---
> Anyway, can we in the spot my patch changed just walk all 
> source->node->callees > cgraph_edges, for each of them find the corresponding 
> cgraph_edge in the alias > and for each walk all the jump_functions recorded 
> and union their m_vr?
> Or is that something that can't be done in LTO for some reason?

That was my fist idea too, but the problem is that icf has (very limited)
support for matching function which differ by order of the basic blocks: it
computes hash of every basic block and orders them by their hash prior
comparing. This seems half-finished since i.e. order of edges in PHIs has to
match exactly.

Callee lists are officially randomly ordered, but practically they follows the
order of basic blocks (as they are built this way).  However since BB orders
can differ, just walking both callee sequences and comparing pairwise does not
work. This also makes merging the information harder, since we no longer have
the BB map at the time decide to merge.

It is however not hard to match the jump function while walking gimple bodies
and comparing statements, which is backportable and localized. I am still
waiting for my statistics to converge and will send it soon.

[Bug ipa/106716] Identical Code Folding (-fipa-icf) confuses between functions with different [[likely]] attributes

2024-03-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106716

--- Comment #6 from Jan Hubicka  ---
The reason why GIMPLE_PREDICT is ignored is that it is never used after ipa-icf
and gets removed at the very beggining of late optimizations.  

GIMPLE_PREDICT is consumed by profile_generate pass which is run before
ipa-icf.  The reason why GIMPLE_PREDICT statements are not stripped during ICF
is early inlining.  If we early inline, we throw away its profile and estimate
it again (in the context of function it was inlined to) and for that it is a
good idea to keep predicts.

There is no convenient place to remove them after early inlining was done and
before IPA passes and that is the only reason why they are around.  We may
revisit that since streaming them to LTO bytecode is probably more harmful then
adding extra pass after early opts to strip them.

ICF doesn't code to compare edge profiles and stmt histograms.  It knows how to
merge them (so resulting BB profile is consistent with merging) but I suppose
we may want to have some threshold on when we do not want to marge functions
with very different branch probabilities in the hot part of their bodies...

[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition

2024-03-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241

Jan Hubicka  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Jan Hubicka  ---
mine. Will debug why the tables diverges.

[Bug debug/92387] [11/12/13 Regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a

2024-03-04 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387

--- Comment #5 from Jan Hubicka  ---
The revision is changing inlining decisions, so it would be probably possible
to reproduce the problem without that change with right alaways_inline and
noinline attributes.

[Bug tree-optimization/114207] [12/13/14 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439

2024-03-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #3 from Jan Hubicka  ---
mine.

The summary is:
  loads:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:4 offset:0 size:64 max_size:64
  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64

while with fwprop we get:
  loads:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64
  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0 param offset:0 offset:0 size:64 max_size:64

So it seems that offset is misaccounted.

[Bug lto/85432] Wodr can be more verbose for C code

2024-03-03 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85432

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #1 from Jan Hubicka  ---
This should be solved for a long time.  We recognize ODR types by mangled names
produced only by C++ frontend.  I checked that GCC 12, 13 and trunk does not
produce the warning.

[Bug tree-optimization/114052] [11/12/13/14 Regression] Wrong code at -O2 for well-defined infinite loop

2024-02-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114052

--- Comment #5 from Jan Hubicka  ---
So if I understand it right, you want to determine the property that if the
loop header is executed then BB containing undefined behavior at that iteration
will be executed, too.

modref tracks if function will always return and if it can not determine it, it
will set the side_effect flag. So you can check for that in modref summary.
It uses finite_function_p which was originally done for pure/const detection
and  is implemented by looking at loop nest if all loops are known to be finite
and also by checking for irreducible loops.

In your setup you probably also want to check for volatile asms that are also
possibly infinite. In mod-ref we get around by considering them to be
side-effects anyway.


There is also determine_unlikely_bbs which is trying to set profile_count to
zero to as many basic blocks as possible by propagating from basic blocks
containing undefined behaviour or cold noreturn call backward & forward.

The backward walk can be used to determine the property hat executing header
implies UB.  It stops on all loops though. In this case it would be nice to
walk through loops known to be finite...

[Bug ipa/108802] [11/12/13/14 Regression] missed inlining of call via pointer to member function

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108802

--- Comment #5 from Jan Hubicka  ---
I don't think we can reasonably expect every caller of lambda function to be
early inlined, so we need to extend ipa-prop to understand the obfuscated code.
 I disucussed that with Martin some time ago - I think this is quite common
problem with modern C++, so we will need to pattern match this, which is quite
unfortunate.

[Bug ipa/111960] [14 Regression] ICE: during GIMPLE pass: rebuild_frequencies: SIGSEGV (Invalid read of size 4) with -fdump-tree-rebuild_frequencies-all

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111960

--- Comment #5 from Jan Hubicka  ---
hmm. cfg.cc:815 for me is:
fputs (", maybe hot", outf);
which seems quite safe.

The problem does not seem to reproduce for me:
jh@ryzen3:~/gcc/build/gcc> ./xgcc -B ./  tt.c -O
--param=max-inline-recursive-depth=100 -fdump-tree-rebuild_frequencies-all
-wrapper valgrind
==25618== Memcheck, a memory error detector
==25618== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25618== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25618== Command: ./cc1 -quiet -iprefix
/home/jh/gcc/build/gcc/../lib64/gcc/x86_64-pc-linux-gnu/14.0.1/ -isystem
./include -isystem ./include-fixed tt.c -quiet -dumpdir a- -dumpbase tt.c
-dumpbase-ext .c -mtune=generic -march=x86-64 -O
-fdump-tree-rebuild_frequencies-all --param=max-inline-recursive-depth=100
-o /tmp/ccpkfjdK.s
==25618== 
==25618== 
==25618== HEAP SUMMARY:
==25618== in use at exit: 1,818,714 bytes in 1,175 blocks
==25618==   total heap usage: 39,645 allocs, 38,470 frees, 12,699,874 bytes
allocated
==25618== 
==25618== LEAK SUMMARY:
==25618==definitely lost: 0 bytes in 0 blocks
==25618==indirectly lost: 0 bytes in 0 blocks
==25618==  possibly lost: 8,032 bytes in 1 blocks
==25618==still reachable: 1,810,682 bytes in 1,174 blocks
==25618== suppressed: 0 bytes in 0 blocks
==25618== Rerun with --leak-check=full to see details of leaked memory
==25618== 
==25618== For lists of detected and suppressed errors, rerun with: -s
==25618== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==25627== Memcheck, a memory error detector
==25627== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25627== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25627== Command: ./as --64 -o /tmp/ccp5TNme.o /tmp/ccpkfjdK.s
==25627== 
==25637== Memcheck, a memory error detector
==25637== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==25637== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==25637== Command: ./collect2 -plugin ./liblto_plugin.so
-plugin-opt=./lto-wrapper -plugin-opt=-fresolution=/tmp/cclWZD7F.res
-plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s
-plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s --eh-frame-hdr -m elf_x86_64 -dynamic-linker
/lib64/ld-linux-x86-64.so.2 /lib/../lib64/crt1.o /lib/../lib64/crti.o
./crtbegin.o -L. -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccp5TNme.o -lgcc
--push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed
-lgcc_s --pop-state ./crtend.o /lib/../lib64/crtn.o
==25637== 
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:
/lib/../lib64/crt1.o: in function `_start':
/home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:103:(.text+0x2b):
undefined reference to `main'
collect2: error: ld returned 1 exit status
==25637== 
==25637== HEAP SUMMARY:
==25637== in use at exit: 89,760 bytes in 39 blocks
==25637==   total heap usage: 175 allocs, 136 frees, 106,565 bytes allocated
==25637== 
==25637== LEAK SUMMARY:
==25637==definitely lost: 0 bytes in 0 blocks
==25637==indirectly lost: 0 bytes in 0 blocks
==25637==  possibly lost: 0 bytes in 0 blocks
==25637==still reachable: 89,760 bytes in 39 blocks
==25637==   of which reachable via heuristic:
==25637== newarray   : 1,544 bytes in 1 blocks
==25637== suppressed: 0 bytes in 0 blocks
==25637== Rerun with --leak-check=full to see details of leaked memory
==25637== 
==25637== For lists of detected and suppressed errors, rerun with: -s
==25637== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

Jan Hubicka  changed:

   What|Removed |Added

Summary|[14 regression] ICU |[12/13/14 regression] ICU
   |miscompiled since on x86|miscompiled since on x86
   |since   |since
   |r14-5109-ga291237b628f41|r14-5109-ga291237b628f41

--- Comment #41 from Jan Hubicka  ---
OK, the reason why this does not work is that ranger ignores earlier value
ranges on everything but default defs and phis.

// This is where the ranger picks up global info to seed initial
// requests.  It is a slightly restricted version of
// get_range_global() above.
//
// The reason for the difference is that we can always pick the
// default definition of an SSA with no adverse effects, but for other
// SSAs, if we pick things up to early, we may prematurely eliminate
// builtin_unreachables.
//
// Without this restriction, the test in g++.dg/tree-ssa/pr61034.C has
// all of its unreachable calls removed too early.
//
// See discussion here:
// https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571709.html

void
gimple_range_global (vrange , tree name, struct function *fun)
{
  tree type = TREE_TYPE (name);
  gcc_checking_assert (TREE_CODE (name) == SSA_NAME);

  if (SSA_NAME_IS_DEFAULT_DEF (name) || (fun && fun->after_inlining)
  || is_a (SSA_NAME_DEF_STMT (name)))
{ 
  get_range_global (r, name, fun);
  return;
}
  r.set_varying (type);
}


This makes ipa-prop to ignore earlier known value range and mask the bug. 
However adding PHI makes the problem to reproduce:
#include 
#include 
int data[100];
int c;

static __attribute__((noinline))
int bar (int d, unsigned int d2)
{
  if (d2 > 30)
  c++;
  return d + d2;
}
static int
test2 (unsigned int i)
{
  if (i > 100)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i&1 ? i+17 : i + 16);
  return data[i];
}

static int
test (unsigned int i)
{
  if (i > 10)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i&1 ? i+17 : i + 16);
  return data[i];
}
int
main ()
{
  int ret = test (1) + test (2) + test (3) + test2 (4) + test2 (30);
  if (!c)
  abort ();
  return ret;
}

This fails with trunk, gcc12 and gcc13 and also with Jakub's patch.

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #39 from Jan Hubicka  ---
This testcase
#include 
int data[100];

__attribute__((noinline))
int bar (int d, unsigned int d2)
{
  if (d2 > 10)
printf ("Bingo\n");
  return d + d2;
}

int
test2 (unsigned int i)
{
  if (i > 10)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  printf ("%i\n",i);
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i+17);
  return data[i];
}
int
test (unsigned int i)
{
  if (i > 100)
__builtin_unreachable ();
  if (__builtin_expect (data[i] != 0, 1))
return data[i];
  printf ("%i\n",i);
  for (int j = 0; j < 100; j++)
data[i] += bar (data[j], i+17);
  return data[i];
}
int
main ()
{
  test (1);
  test (2);
  test (3);
  test2 (4);
  test2 (100);
  return 0;
}

gets me most of what I want to reproduce ipa-prop problem. Functions test and
test2 are split with different value ranges visible in the fnsplit dump. 
However curiously enough ipa-prop analysis seems to ignore the value ranges and
does not attach them to the jump function, which is odd...

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #31 from Jan Hubicka  ---
Having a testcase is great. I was just playing with crafting one.
I am still concerned about value ranges in ipa-prop's jump functions.
Let me see if I can modify the testcase to also trigger problem with value
ranges in ipa-prop jump functions.

Not streaming value ranges is an omission on my side (I mistakely assumed we do
stream them).  We ought to stream them, since otherwise we will lose propagated
return value ranges in partitioned programs, which is pity.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #6 from Jan Hubicka  ---
Created attachment 57427
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57427=edit
patch

The patch makes compilation to finish in reasonable time.
I ended up in need to dropping DISREGARD_INLINE_LIMITS in late inlining for
functions with self recursive always inlines, since these grow large quick and
even non-recursive inlining is too slow.  We also end up with quite ugly
diagnostics of form:

tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param
max-inline-insns-auto limit reached
   13 | f1 (void)
  | ^~
tt.c:17:3: note: called from here
   17 |   f1 ();
  |   ^
tt.c:6:1: error: inlining failed in call to ‘always_inline’ ‘f0’: --param
max-inline-insns-auto limit reached
6 | f0 (void)
  | ^~
tt.c:16:3: note: called from here
   16 |   f0 ();
  |   ^
tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param
max-inline-insns-auto limit reached
   13 | f1 (void)
  | ^~
tt.c:15:3: note: called from here
   15 |   f1 ();
  |   ^
In function ‘f1’,
inlined from ‘f0’ at tt.c:8:3,


which is quite large so I can not add it to a testuiste.  I will see if I can
reduce this even more.

[Bug middle-end/111054] [14 Regression] ICE: in to_sreal, at profile-count.cc:472 with -O3 -fno-guess-branch-probability since r14-2967

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111054

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Jan Hubicka  ---
Fixed.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #5 from Jan Hubicka  ---
There is a cap in want_inline_self_recursive_call_p which gives up on inlining
after reaching max recursive inlining depth of 8. Problem is that the tree here
is too wide. After early inlining f0 contains 4 calls to f1 and 3 calls to f0.
Similarly for f0, so we have something like (9+3*9)^8 as a cap on number of
inlines that takes a while to converge.

One may want to limit number of copies of function A within function B rather
than depth, but that number can be large even for sane code.

I am making patch to make inliner to ignore always_inline on all self-recrusive
inline decisions.

[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291

--- Comment #4 from Jan Hubicka  ---
There is a cap in want_inline_self_recursive_call_p which gives up on inlining
after reaching max recursive inlining depth of 8. Problem is that the tree here
is too wide. After early inlining f0 contains 4 calls to f1 and

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-14 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #29 from Jan Hubicka  ---
Safest fix is to make equals_p to reject merging functions with different value
ranges assigned to corresponding SSA names.  I would hope that, since early
opts are still mostly local, that does not lead to very large degradation. This
is lame of course.

If we go for smarter merging, we need to also handle ipa-prop jump functions. 
In that case I think equals_p needs to check if value range sin SSA_NAMES and
jump functions differs and if so, keep that noted so the merging code can do
corresponding update.  I will check how hard it is to implement this. 
(Equality handling is Martin Liska's code, but if I recall right, each
equivalence class has a leader, and we can keep track if there are some
differences WRT that leader, but I do not recall how subdivision of equivalence
classes is handled).

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-13 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #13 from Jan Hubicka  ---
So my understanding is that ivopts does something like

 offset =  - 

and then translate
 val = base2[i]
to
 val = *((base1+i)+offset)

Where (base1+i) is then an iv variable.

I wonder if we consider doing memory reference with base changed via offset a
valid transformation. Is there way to tell when this happens?
A quick fix would be to run IPA modref before ivopts, but I do not see how such
transformation can work with rest of alias analysis (PTA etc)

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #8 from Jan Hubicka  ---
I will take a look.  Mod-ref only reuses the code detecting errneous paths in
ssa-split-paths, so that code will get confused, too. It makes sense for ivopts
to compute difference of two memory allocations, but I wonder if that won't
also confuse PTA and other stuff, so perhaps we need way to exlicitely tag
memory location where such optimization happen? (to make it clear that original
base is lost, or keep track of it)

[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64

2024-02-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #11 from Jan Hubicka  ---
If there are two ODR types with same ODR name one with integer and other with
pointer types third field, then indeed we should get ODR warning and give up on
handling them as ODR types for type merging.

So dumping their assembler names would be useful starting point.

Of course if you have two ODR types of different names but you mix them up in
COMDAT function of same name, then the warning will not trigger, so this might
be some missing type compatibility check in ipa-sra or ipa-prop summary, too.

[Bug ipa/97119] Top level option to disable creation of IPA symbols such as .localalias is desired

2024-02-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97119

--- Comment #7 from Jan Hubicka  ---
Local aliases are created by ipa-visibility pass.  Most common case is that
function is declared inline but ELF superposition rules say that the symbol can
be overwritten by a different library.  Since GCC knows that all
implementaitons must be equivalent, it can force calls within DSO to be direct.

I am not quite sure how this confuses stack unwinding on Solaris?

For live patching, if you want to patch inline function, one definitely needs
to look for places it has been inlined to. However in the situation the
function got offlined, I think live patching should just work, since it will
place jump in the beggining of function body.

The logic for creating local aliases is in ipa-visibility.cc.  Adding command
line option to control it is not hard. There are other transformations we do
there - like breaking up comdat groups and other things.

part aliases are controlled by -fno-partial-inlining, isra by -fno-ipa-sra.
There is also ipa-cp controlled by -fno-ipa-prop.
We also do alises as part of openMP offlining and LTO partitioning that are
kind of mandatory (there is no way to produce correct code without them).

[Bug ipa/113422] Missed optimizations in the presence of pointer chains

2024-01-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113422

--- Comment #2 from Jan Hubicka  ---
Cycling read-only var discovery would be quite expensive, since you need to
interleave it with early opts each round.  I wonder how llvm handles this?

I think there is more hope with IPA-PTA getting scalable version at -O2 and
possibly being able to solve this.

[Bug ipa/113520] ICE with mismatched types with LTO (tree check: expected array_type, have integer_type in array_ref_low_bound)

2024-01-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113520

--- Comment #8 from Jan Hubicka  ---
I think the ipa-cp summaries should be used only when types match. At least
Martin added type streaming for all the jump functions.  So we are missing some
check?

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Jan Hubicka  ---
Fixed.

[Bug c++/109753] [13/14 Regression] pragma GCC target causes std::vector not to compile (always_inline on constructor)

2024-01-10 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109753

--- Comment #12 from Jan Hubicka  ---
I think this is a problem with two meanings of always_inline. One is "it must
be inlined or otherwise we will not be able to generate code" other is
"disregard inline limits".

I guess practical solution here would be to ingore always inline for functions
called from static construction wrappers (since they only optimize around array
of function pointers). Question is how to communicate this down from FE to
ipa-inline...

[Bug middle-end/79704] [meta-bug] Phoronix Test Suite compiler performance issues

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704
Bug 79704 depends on bug 109811, which changed state.

Bug 109811 Summary: libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #19 from Jan Hubicka  ---
I think we can declare this one fixed.

[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236

Jan Hubicka  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-01-05
 CC||hubicka at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #2 from Jan Hubicka  ---
On zen3 I get 0.75MP/s for GCC and 0.80MP/s for clang, so only 6.6%, but seems
reproducible.

Profile looks comparable:

gcc
  30.96%  cwebplibwebp.so.7.1.5   [.]
GetCombinedEntropyUnre
  26.19%  cwebplibwebp.so.7.1.5   [.] VP8LHashChainFill 
   3.34%  cwebplibwebp.so.7.1.5   [.]
CalculateBestCacheSize
   3.30%  cwebplibwebp.so.7.1.5   [.]
CombinedShannonEntropy
   3.21%  cwebplibwebp.so.7.1.5   [.]
CollectColorBlueTransf

clang:

  34.06%  cwebplibwebp.so.7.1.5[.] GetCombinedEntropy   
  28.95%  cwebplibwebp.so.7.1.5[.] VP8LHashChainFill
   5.37%  cwebplibwebp.so.7.1.5[.]
VP8LGetBackwardReferences
   4.39%  cwebplibwebp.so.7.1.5[.]
CombinedShannonEntropy_SS
   4.28%  cwebplibwebp.so.7.1.5[.]
CollectColorBlueTransform


In the first loop clang seems to ifconvert while GCC doesn't:
  0.59 │   lea  kSLog2Table,%rdi
  3.69 │   vmovss   (%rdi,%rax,4),%xmm0
  0.98 │ 6f:   vcvtsi2ss%edx,%xmm2,%xmm1
  0.63 │   vfnmadd213ss 0x0(%r13),%xmm0,%xmm1
 38.16 │   vmovss   %xmm1,0x0(%r13)
  5.48 │   cmp  %r12d,0xc(%r13)
  0.06 │ ↓ jae  89 
   │   mov  %r12d,0xc(%r13)
  0.99 │ 89:   mov  0x4(%r13),%edi 
  0.96 │ 8d:   xor  %eax,%eax  
  0.40 │   test %r12d,%r12d
  0.60 │   setne%al 



   │   vcvtsd2ss%xmm0,%xmm0,%xmm1   
  0.02 │362:   mov  %r15d,%eax  
  0.57 │   imul %r12d,%eax  
  0.00 │   cmp  %r12d,%r9d  
  0.03 │   cmovbe   %r12d,%r9d  
  0.02 │   vmovd%eax,%xmm0  
  0.08 │   vpinsrd  $0x1,%r15d,%xmm0,%xmm0  
  1.50 │   vpaddd   %xmm0,%xmm4,%xmm4   
  1.08 │   vcvtsi2ss%r15d,%xmm5,%xmm0   
  0.87 │   vfnmadd231ss %xmm0,%xmm1,%xmm3   
  5.40 │   vmovaps  %xmm3,%xmm0 
  0.02 │38c:   xor  %eax,%eax   
  0.16 │   cmp  $0x4,%r15d

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #6 from Jan Hubicka  ---
The internal loops are:

static const unsigned keccakf_rotc[24] = {
   1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18,
39, 61, 20, 44
}; 

static const unsigned keccakf_piln[24] = {
   10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14,
22, 9, 6, 1
};

static void keccakf(ulong64 s[25])
{  
   int i, j, round;
   ulong64 t, bc[5];

   for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) {
  /* Theta */
  for(i = 0; i < 5; i++)
 bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20];

  for(i = 0; i < 5; i++) { 
 t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1);
 for(j = 0; j < 25; j += 5)
s[j + i] ^= t;
  }
  /* Rho Pi */
  t = s[1];
  for(i = 0; i < 24; i++) {
 j = keccakf_piln[i];
 bc[0] = s[j];
 s[j] = ROL64(t, keccakf_rotc[i]);
 t = bc[0];
  }
  /* Chi */
  for(j = 0; j < 25; j += 5) {
 for(i = 0; i < 5; i++)
bc[i] = s[j + i];
 for(i = 0; i < 5; i++)
s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5];
  }
  s[0] ^= keccakf_rndc[round];
   }
}

I suppose with complete unrolling this will propagate, partly stay in registers
and fold. I think increasing the default limits, especially -O3 may make sense.
Value of 16 is there for very long time (I think since the initial
implementation).

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark
   |is almost 40% slower vs.|is almost 40% slower vs.
   |Clang   |Clang (not enough complete
   ||loop peeling)

--- Comment #5 from Jan Hubicka  ---
On my zen3 machine default build gets me 180MB/S
-O3 -flto -funroll-all-loops gets me 193MB/s
-O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone
with --param max-completely-peel-times=20, default is 16.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #4 from Jan Hubicka  ---
I keep mentioning to Larabel that he should use -fno-semantic-interposition,
but he doesn't.

Profile is very simple:

 96.75%  SMHasher[.] keccakf.lto_priv.0
  ◆

All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get:

  3.85 │330:   mov%r8,%rdi  
  7.68 │   movslq (%rsi,%r9,1),%rcx 
  3.85 │   lea(%rax,%rcx,8),%r10
  3.86 │   mov(%rdx,%r9,1),%ecx 
  3.83 │   add$0x4,%r9  
  3.86 │   mov(%r10),%r8
  7.37 │   rol%cl,%rdi  
  7.37 │   mov%rdi,(%r10)   
  4.76 │   cmp$0x60,%r9 
  0.00 │ ↑ jne330   


Clang seems to unroll it:

 0.25 │ d0:   mov  -0x48(%rsp),%rdx
  ▒
  0.25 │   xor  %r12,%rcx  
   ▒
  0.25 │   mov  %r13,%r12  
   ▒
  0.25 │   mov  %r13,0x10(%rsp)
   ▒
  0.25 │   mov  %rax,%r13  
   ◆
  0.26 │   xor  %r15,%r13  
   ▒
  0.23 │   mov  %r11,-0x70(%rsp)   
   ▒
  0.25 │   mov  %r8,0x8(%rsp)  
   ▒
  0.25 │   mov  %r15,-0x40(%rsp)   
   ▒
  0.25 │   mov  %r10,%r15  
   ▒
  0.26 │   mov  %r10,(%rsp)
   ▒
  0.26 │   mov  %r14,%r10  
   ▒
  0.25 │   xor  %r12,%r10  
   ▒
  0.26 │   xor  %rsi,%r15  
   ▒
  0.24 │   mov  %rbp,-0x80(%rsp)   
   ▒
  0.25 │   xor  %rcx,%r15  
   ▒
  0.26 │   mov  -0x60(%rsp),%rcx   
   ▒
  0.25 │   xor  -0x68(%rsp),%r15   
   ▒
  0.26 │   xor  %rbp,%rdx  
   ▒
  0.25 │   mov  -0x30(%rsp),%rbp   
   ▒
  0.25 │   xor  %rdx,%r13  
   ▒
  0.24 │   mov  -0x10(%rsp),%rdx   
   ▒
  0.25 │   mov  %rcx,%r12  
   ▒
  0.24 │   xor  %rcx,%r13  
   ▒
  0.25 │   mov  $0x1,%ecx  
   ▒
  0.25 │   xor  %r11,%rdx  
   ▒
  0.24 │   mov  %r8,%r11   
   ▒
  0.25 │   mov  -0x28(%rsp),%r8
   ▒
  0.26 │   xor  -0x58(%rsp),%r8
   ▒
  0.24 │   xor  %rdx,%r8   
   ▒
  0.26 │   mov  -0x8(%rsp),%rdx
   ▒
  0.25 │   xor  %rbp,%r8   
   ▒
  0.26 │   xor  %r11,%rdx  
   ▒
  0.25 │   mov  -0x20(%rsp),%r11   
   ▒
  0.25 │   xor  %rdx,%r10  
   ▒

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2024-01-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

--- Comment #23 from Jan Hubicka  ---
Created attachment 56970
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56970=edit
Patch I am testing

Hi,
this adds -falign-all-functions parameter.  It still look like more reasonable
(and backward compatible) thing to do.  I also poked about Richi's suggestion
of extending the syntax of -falign-functions but I think it is less readable.

[Bug ipa/92606] [11/12/13 Regression][avr] invalid merge of symbols in progmem and data sections

2023-12-12 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92606

--- Comment #31 from Jan Hubicka  ---
This is Maritn's code, but I agree that equals_wpa should reject pairs with
"dangerous" attributes on them (ideally we should hash them). 
I think we could add test for same attributes to equals_wpa and eventually
white list attributes we consider mergeable?
There are attributes that serves no meaning once we enter backend, so it may be
also good option to strip them, so they are not confusing passes like ICF.

[Bug ipa/81323] IPA-VRP doesn't handle return values

2023-12-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81323

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #9 from Jan Hubicka  ---
Note that  r14-5628-g53ba8d669550d3 does just the easy part of propagating
within single translation unit. We will need to add actual IPA bits into WPA
next stage1

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-12-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #18 from Jan Hubicka  ---
Reading all the discussion again, I am leaning towards -falign-all-functions +
documentation update explaining that -falign-functions/-falign-loops are
optimizations and ignored for -Os.

I do use -falign-functions/-falign-loops when tuning for new generations of
CPUs and I definitely want to have way to specify alignment that is ignored for
cold functions (as perforance optimization) and we have this behavior since
profile code was introduced in 2002.

As an optimization, we also want to have hot functions aligned more than 8 byte
boundary needed for patching.

I will prepare patch for this and send it for disucssion.  Pehraps we want
-flive-patching to also imply FUNCTION_BOUNDARY increase on x86-64? Or is live
patching useful if function entries are not aligned?

[Bug tree-optimization/110062] missed vectorization in graphicsmagick

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #11 from Jan Hubicka  ---
trunk -O3 -flto -march=native -fopenmp
Operation: Sharpen:
257
256
256

Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
257
256
256

Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
   Operation: Sharpen:
257
256
256
Average: 256 Iterations Per Minute

So I guess I will need to try on zen3 to see if there is any difference.

the internal loop is:
  0.00 │460:┌─→movzbl  0x2(%rdx,%rax,4),%esi ▒
  0.02 ││  vmovss  (%r8,%rax,4),%xmm2▒
  0.95 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 20.22 ││  movzbl  0x1(%rdx,%rax,4),%esi ▒
  0.01 ││  vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
 11.97 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 18.76 ││  movzbl  (%rdx,%rax,4),%esi▒
  0.00 ││  inc %rax  ▒
  0.72 ││  vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
 12.55 ││  vcvtsi2ss   %esi,%xmm0,%xmm1  ▒
 14.95 ││  vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
 15.93 │├──cmp %rax,%r13 ▒
  0.35 │└──jne 460  

so it still does not get

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #18 from Jan Hubicka  ---
I made a typo:

Mainline with -O2 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

Should be 
Mainline with -O3 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

So with -O2 we still get slightly lower score than clang with -O3 we are
slightly better. push_back inlining does not seem to be a problem (as tested by
increasing limits) so perhaps more agressive unrolling/vectorization settings
clang has at -O2.

I think upstream jpegxl should use -O3 or -Ofast instead of -O2.  It is quite
typical kind of task that benefits from large optimization levels.

I filled in https://github.com/libjxl/libjxl/issues/2970

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #20 from Jan Hubicka  ---
On zen4 hardware I now get

GCC13 with -O3 -flto -march=native -fopenmp
2163
2161
2153

Average: 2159 Iterations Per Minute

clang 17 with -O3 -flto -march=native -fopenmp
2004
1988
1991

Average: 1994 Iterations Per Minute

trunk -O3 -flto -march=native -fopenmp
Operation: Resizing:
2126
2135
2123

Average: 2128 Iterations Per Minute

So no big changes here...

[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #8 from Jan Hubicka  ---
On ARM32 and other targets methods returns this pointer.  Togher with making
return value escape this probably completely disables any chance for IPA
tracking of C++ data types...

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #10 from Jan Hubicka  ---
runtimes on zen4 hardware.

trunk -O3 -flto -march-native
42171
42964
42106
clang -O3 -flto -march=native
37393
37423
37508
gcc 13 -O3 -flto -march=native
42380
42314
43285

So seems the performance did not change

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #15 from Jan Hubicka  ---
With SRA improvements r:aae723d360ca26cd9fd0b039fb0a616bd0eae363 we finally get
good performance at -O2. Improvements to push_back implementation also helps a
bit.

Mainline with default flags (-O2):
Input: JPEG - Quality: 90:
19.76
19.75
19.68
Mainline with -O2 -march=native:
Input: JPEG - Quality: 90:
20.01
20
19.98
Mainline with -O2 -march=native -flto
Input: JPEG - Quality: 90:
19.95
19.98
19.81
Mainline with -O2 -march=native -flto --param max-inline-insns-auto=80 (this
makes push_back inlined)
Input: JPEG - Quality: 90:
19.98
20.05
20.03
Mainline with -O2 -flto  -march=native -I/usr/include/c++/v1 -nostdinc++ -lc++
(so clang's libc++)
21.38
21.37
21.32
Mainline with -O2 -flto  -march=native run manualy since build machinery patch
is needed
23.03
22.85
23.04
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with system libstdc++ from
GCC13 so before push_back improvements.
21.16
20.95
21.06
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with trunk libstdc++ with
push_back improvements.
21.2
20.93
20.98
Clang 17 with -O2 -march=native -flto -stdlib=libc++ and also
-fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with clan'g
libc++
Input: JPEG - Quality: 90:
22.08
21.88
21.78
Clang 17 with -O3 -march=native -flto
23.08
22.90
22.84


libc++ declares push_back always_inline and splits out the slow copying path. I
think the inlined part is still bit too large for inlining at -O2.

We could still try to get remaining approx 10% without increasing code size at 
-O2
However major part of the problem is solved.

[Bug middle-end/112706] New: missed simplification in FRE

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112706

Bug ID: 112706
   Summary: missed simplification in FRE
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Compiling the following testcase (simplified from repeated
std::vector::push_back expansion):

int *ptr;
void link_error ();
void
test ()
{

int *ptr1 = ptr + 10;
int *ptr2 = ptr + 20;
if (ptr1 == ptr2)
link_error ();
}

with gcc -O2 t.C -fdump-tree-all-details
one can check that link_error is optimized away really late:

jh@ryzen4:/tmp> grep link_error a-t.C*

a-t.C.106t.cunrolli:  link_error ();
a-t.C.107t.backprop:  link_error ();
a-t.C.108t.phiprop:  link_error ();
a-t.C.109t.forwprop2:link_error ();

this is too late for some optimization to catch up (in the case of std::vector
we end up missing DSE since the transform is delayed to forwprop3)

I think this is something value numbering should catch.

[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function

2023-11-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #7 from Jan Hubicka  ---
Thanks for explanation.  I think it is quite common pattern that new object is
construted and worked on and later returned, so I think we ought to handle this
correctly.

Another example just came up in
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637878.html

We should gnerate same code for the following two functions:

#include 

auto
f()
{
  std::vector x;
  x.reserve(10);
  for (int i = 0; i < 10; ++i)
x.push_back(0);
  return x;
}

auto
g()
{ return std::vector(10, 0); }


but we don't since we lose track of values stored in x after every call to new.

[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns

2023-11-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657

--- Comment #8 from Jan Hubicka  ---
The negative return value branch predictor is set to have 98% hitrate (measured
on SPEC2k17 some time ago).  There is --param predictable-branch-outcome that
is also set to 2% so indeed we consider the branch as well predictable by this
heuristics.

Reducing --param should make cmov to happen.

With profile_probability data type we could try something smarter on guessing
if given branch is predictable (such as ignoring guessed values and let
predictor to optionally mark branches as (un)predictable). But it is not quite
clear to me what desired behavior would be...

Guessing predictability of data branches is generally quite hard problem.
Predictablity of loop branches is easier, but we hardly apply BRANCH_COST on
branch closing loop since those are not if-conversion candidates.

[Bug ipa/98925] Extend ipa-prop to handle return functions for slot optimization

2023-11-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98925

--- Comment #3 from Jan Hubicka  ---
Return value range propagation was added in
r:53ba8d669550d3a1f809048428b97ca607f95cf5

however it works on scalar return values only for now. Extending it to
aggregates is a logical next step and should not be terribly hard.

The code also misses logic for IPA streaming so it works only in ealry and late
opts.

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-11-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #17 from Jan Hubicka  ---
-falign-functions/-falign-jumps/-falign-labels/-falign-loops are originally are
intended for performance tuning.  Starting function entry close to the end of
page of code cache may lead to wasted code cache space as well as higher
overhead calling the function when CPU fetches page which contains just little
useful information.

As such I would like to keep them affecting only hot code (we should update
documentation for that).  Internally we have FUNCTION_BOUNDARY which specifies
minimal alignment needed by ABI, which is set to 8bits for i386.  My
understanding is that -fpatchable-function-entry requires the alignment to be
64bits in order to make it possible to atomically change the instruction.

So perhaps we want to make FUNCTION_BOUNDARY to be 64 for functions where we
output the patchable entry?
I am also OK with extending the flag syntax or adding -fmin-function-alignment
to specify optional user-defined minimum (increase FUNCTION_BOUNDARY) if that
seems useful, but I think the first one is most consistent way to go with live
patching?

[Bug middle-end/112653] We should optimize memmove to memcpy using alias oracle

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #3 from Jan Hubicka  ---
PR82898 testcases seems to be about type based alias analysis. However PTA
should be useable here.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849
Bug 109849 depends on bug 110377, which changed state.

Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from 
__builtin_unreachable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287
Bug 110287 depends on bug 110377, which changed state.

Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from 
__builtin_unreachable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/110377] Early VRP and IPA-PROP should work out value ranges from __builtin_unreachable

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377

Jan Hubicka  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #7 from Jan Hubicka  ---
Fixed.

[Bug middle-end/112653] New: We should optimize memmove to memcpy using alias oracle

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

Bug ID: 112653
   Summary: We should optimize memmove to memcpy using alias
oracle
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

In this testcase (losely based on libstdc++ implementation of vectors)
I we should be able to turn memmove to memcpy because we know that the two
parameters can not alias
#include 
#include 
char *test;
char *
copy_test ()
{
char *test2 = malloc (1000);
memmove (test2, test, 1000);
return test2;
}

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2023-11-21

--- Comment #10 from Jan Hubicka  ---
We now produce reasonable code for _M_check_len and propagate value range of
the return value. This helps us to notice that later allocator call will not
throw exception on invalid size, so we are down from 3 throw calls to one.

Current code is:

size_type std::vector::_M_check_len (const struct vector * const this,
size_type __n, const char * __s)
{
  const size_type __len;
  long unsigned int _1;
  long unsigned int __n.3_2;
  size_type iftmp.4_3;
  long unsigned int _4;
  long unsigned int _7;
  long unsigned int _8;
  long int _9;
  long int _11;
  struct pair_t * _12;
  struct pair_t * _13;

   [local count: 1073741824]:
  _13 = this_6(D)->D.26060._M_impl.D.25361._M_finish;
  _12 = this_6(D)->D.26060._M_impl.D.25361._M_start;
  _11 = _13 - _12;
  _9 = _11 /[ex] 8;
  _7 = (long unsigned int) _9;
  _1 = 1152921504606846975 - _7;
  __n.3_2 = __n;
  if (_1 < __n.3_2)
goto ; [0.00%]
  else
goto ; [100.00%]

   [count: 0]:
  std::__throw_length_error (__s_14(D));

   [local count: 1073741824]:
  _8 = MAX_EXPR <__n.3_2, _7>;
  __len_10 = _7 + _8;
  if (_7 > __len_10)
goto ; [35.00%]
  else
goto ; [65.00%]

   [local count: 697932184]:
  _4 = MIN_EXPR <__len_10, 1152921504606846975>;

   [local count: 1073741824]:
  # iftmp.4_3 = PHI <1152921504606846975(4), _4(5)>
  return iftmp.4_3;

}
I still think we could play games with the 2^63 being too large for the
standard allocator and turn __throw_length_error into __builtin_unreachable for
that case. This would help early inliner to inline this function and save some
throw calls in real code.

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #9 from Jan Hubicka  ---
This is _M_realloc insert at release_ssa time:

eleased 63 names, 165.79%, removed 63 holes
void std::vector::_M_realloc_insert (struct vector *
const this, struct iterator __position, const struct pair_t & __args#0)
{ 
  struct pair_t * const __position;
  struct pair_t * __new_finish;
  struct pair_t * __old_finish;
  struct pair_t * __old_start;
  long unsigned int _1; 
  struct pair_t * _2;
  struct pair_t * _3;
  long int _4;
  long unsigned int _5;
  struct pair_t * _6;
  const size_type _10;
  long int _13;
  struct pair_t * iftmp.5_15;
  struct pair_t * _17; 
  struct _Vector_impl * _18;
  long unsigned int _22;
  long int _23;
  long unsigned int _24;
  long unsigned int _25;
  struct pair_t * _26;
  long unsigned int _36;

   [local count: 1073741824]:
  __position_27 = MEM[(struct __normal_iterator *)&__position];
  _10 = std::vector::_M_check_len (this_8(D), 1,
"vector::_M_realloc_insert");
  __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start;
  __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish;
  _13 = __position_27 - __old_start_11;
  if (_10 != 0)
goto ; [54.67%]
  else
goto ; [45.33%]

   [local count: 587014656]:
  _18 = [(struct _Vector_base *)this_8(D)]._M_impl;
  _17 = std::__new_allocator::allocate (_18, _10, 0B);

   [local count: 1073741824]:
  # iftmp.5_15 = PHI <0B(2), _17(3)>
  _1 = (long unsigned int) _13;
  _2 = iftmp.5_15 + _1;
  *_2 = *__args#0_14(D);
  if (_13 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  __builtin_memmove (iftmp.5_15, __old_start_11, _1);

   [local count: 1073741824]:
  _36 = _1 + 8;
  __new_finish_16 = iftmp.5_15 + _36;
  _23 = __old_finish_12 - __position_27;
  if (_23 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  _24 = (long unsigned int) _23;
  __builtin_memcpy (__new_finish_16, __position_27, _24);

   [local count: 1073741824]:
  _25 = (long unsigned int) _23;
  _26 = __new_finish_16 + _25;
  _3 = this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage;
  _4 = _3 - __old_start_11;
  if (__old_start_11 != 0B)
goto ; [53.47%]
  else
goto ; [46.53%]

   [local count: 574129752]:
  _22 = (long unsigned int) _4;
  operator delete (__old_start_11, _22);

   [local count: 1073741824]:
  this_8(D)->D.25975._M_impl.D.25282._M_start = iftmp.5_15;
  this_8(D)->D.25975._M_impl.D.25282._M_finish = _26;
  _5 = _10 * 8;
  _6 = iftmp.5_15 + _5;
  this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage = _6;
  return;

}

First it is not clear to me why we need memmove at all?

So first issue is:
   [local count: 1073741824]:
  __position_27 = MEM[(struct __normal_iterator *)&__position];
  _10 = std::vector::_M_check_len (this_8(D), 1,
"vector::_M_realloc_insert");
  __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start;
  __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish;
  _13 = __position_27 - __old_start_11;
  if (_10 != 0)
goto ; [54.67%]
  else
goto ; [45.33%]

Without inlining _M_check_len early we can not work out return value range,
since we need to know that paramter 2 is 1 and not 0.
Adding __builtin_unreachable check after helps to reduce

if (_10 != 0)

but I need to do something about inliner accounting the conditional to function
body size.

   [local count: 1073741824]:
  # iftmp.5_15 = PHI <0B(2), _17(3)>
  _1 = (long unsigned int) _13;
  _2 = iftmp.5_15 + _1;
  *_2 = *__args#0_14(D);
  if (_13 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  __builtin_memmove (iftmp.5_15, __old_start_11, _1);

Is this code about inserting value to the middle?  Since push_back always
initializes iterator to point to the end, this seems quite sily to do.
Can't we do somehting like _M_realloc_append?

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #21 from Jan Hubicka  ---
Patch
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html
gets us closer to inlining _M_realloc_insert at -O3 (3 insns away)

Patch
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636935.html
reduces the expense when _M_realloc_insert is not inlined at -O2 (where I think
we should not inline it, unlike for clang)

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #8 from Jan Hubicka  ---
With return value range propagation
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html
reduces --param max-inline-insns-auto needed for _M_realloc_insert to be
inlined on my testcase from 39 to 35.

This is done by eliminating two unnecesary trow calls by propagating fact that
check_len does not return incredibly large values.

Default inline limit at -O3 is 30, so we are not that far and I think we really
ought to solve this for next release since push_back is such a common case.

Is it known that check_len can not return 0 in this situation? Adding 
if (ret <= 0)
  __builtin_unreachable
saves another 2 instructions because _M_realloc_insert otherwise contain a code
path for case that vector gets increased to 0 elements.

[Bug tree-optimization/112618] New: internal compiler error: in expand_MASK_CALL, at internal-fn.cc:4529

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112618

Bug ID: 112618
   Summary: internal compiler error: in expand_MASK_CALL, at
internal-fn.cc:4529
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

jh@ryzen4:~/gcc/build4/stage1-gcc> cat b.c
/* PR tree-optimization/106433 */

int m, *p;

__attribute__ ((simd)) int
bar (int x)
{
  if (x)
{
  if (m < 1)
for (m = 0; m < 1; ++m)
  ++x;
  p = 
  for (;;)
++m;
}
  return 0;
}

__attribute__ ((simd)) int
foo (int x)
{
 bar (x);
 return 0;
}
jh@ryzen4:~/gcc/build4/stage1-gcc> ./xgcc -B ./ -O2 b.c -fno-tree-vrp
during RTL pass: expand
b.c: In function ‘foo.simdclone.3’:
b.c:23:2: internal compiler error: in expand_MASK_CALL, at internal-fn.cc:5013
   23 |  bar (x);
  |  ^~~
0x12db307 expand_MASK_CALL(internal_fn, gcall*)
../../gcc/internal-fn.cc:5013
0x12daa47 expand_internal_call(internal_fn, gcall*)
../../gcc/internal-fn.cc:4920
0x12daa72 expand_internal_call(gcall*)
../../gcc/internal-fn.cc:4928
0xf7637e expand_call_stmt
../../gcc/cfgexpand.cc:2737
0xf7a5a8 expand_gimple_stmt_1
../../gcc/cfgexpand.cc:3880
0xf7ac2c expand_gimple_stmt
../../gcc/cfgexpand.cc:4044
0xf82d6f expand_gimple_basic_block
../../gcc/cfgexpand.cc:6100
0xf85322 execute
../../gcc/cfgexpand.cc:6835
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug tree-optimization/110641] [14 Regression] ICE in adjust_loop_info_after_peeling, at tree-ssa-loop-ivcanon.cc:1023 since r14-2230-g7e904d6c7f2

2023-11-06 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110641

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #3 from Jan Hubicka  ---
mine.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #13 from Jan Hubicka  ---
So I re-tested it with current mainline and clang 16/17

For mainline I get (megapixels per second, bigger is better):
13.39
13.38
13.42
clang 16:
20.06
20.06
19.87
clang 17:
19.7
19.68
19.69


mainline with Martin's patch to enable SRA across calls where parameter doesn't
example (improvement for PR109849) I get:
19.37
19.35
19.31
this is without inlining m_realloc_insert which we do at -O3 but we don't at
-O2 since it is large (clang inlines at both at -O2 and -O3).

[Bug ipa/59948] Optimize std::function

2023-09-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59948

--- Comment #8 from Jan Hubicka  ---
Trunk optimized stuff return 0, but fails to optimize out functions which
becomes unused after indirect inlining.
With -fno-early-inlining we end up with:

int m ()
{
  void * D.48296;
  int __args#0;
  struct function h;
  int _12;
  bool (*) (union _Any_data & {ref-all}, const union _Any_data &
{ref-all}, _Manager_operation) _24;
  bool (*) (union _Any_data & {ref-all}, const union _Any_data &
{ref-all}, _Manager_operation) _27;
  long unsigned int _29;
  long unsigned int _35;
  vector(2) long unsigned int _37;
  void * _42;

   [local count: 1073741824]:
  _29 = (long unsigned int) _M_invoke;
  _35 = (long unsigned int) _M_manager;
  _37 = {_35, _29};
  h ={v} {CLOBBER};
  MEM  [(struct _Function_base *) + 8B] = {};
  MEM[(int (*) (int) *)] = f;
  MEM  [(void *) + 16B] = _37;
  __args#0 = 1;
  _12 = std::_Function_handler::_M_invoke
(_M_functor, &__args#0);

   [local count: 1073312329]:
  __args#0 ={v} {CLOBBER(eol)};
  _24 = MEM[(struct _Function_base *)]._M_manager;
  if (_24 != 0B)
goto ; [70.00%]
  else
goto ; [30.00%]

   [local count: 751318634]:
  _24 ([(struct _Function_base *)]._M_functor, [(struct
_Function_base *)]._M_functor, 3);

   [local count: 1073312329]:
  h ={v} {CLOBBER};
  h ={v} {CLOBBER(eol)};
  return _12;

   [count: 0]:
:
  _27 = MEM[(struct _Function_base *)]._M_manager;
  if (_27 != 0B)
goto ; [0.00%]
  else
goto ; [0.00%]

   [count: 0]:
  _27 ([(struct _Function_base *)]._M_functor, [(struct
_Function_base *)]._M_functor, 3);

   [count: 0]:
  h ={v} {CLOBBER};
  _42 = __builtin_eh_pointer (2);
  __builtin_unwind_resume (_42);

}

ipa-prop fails to track the pointer passed around:

IPA function summary for int m()/288 inlinable
  global time: 41.256800
  self size:   16
  global size: 41
  min size:   38
  self stack:  32
  global stack:32
size:19.00, time:8.66
size:3.00, time:2.00,  executed if:(not inlined)
  calls:
std::function::~function()/286 inlined
  freq:0.00
  Stack frame offset 32, callee self size 0
  std::_Function_base::~_Function_base()/71 inlined
freq:0.00
Stack frame offset 32, callee self size 0
indirect call loop depth: 0 freq:0.00 size: 6 time: 18
std::function::~function()/404 inlined
  freq:1.00
  Stack frame offset 32, callee self size 0
  std::_Function_base::~_Function_base()/405 inlined
freq:1.00
Stack frame offset 32, callee self size 0
indirect call loop depth: 0 freq:0.70 size: 6 time: 18
_Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const
[with _Res = int; _ArgTypes = {int}]/304 inlined
  freq:1.00
  Stack frame offset 32, callee self size 0
  void std::__throw_bad_function_call()/374 function body not available
freq:0.00 loop depth: 0 size: 1 time: 10
  _M_empty.isra/384 inlined 
freq:1.00
Stack frame offset 32, callee self size 0
  indirect call loop depth: 0 freq:1.00 size: 6 time: 18
std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor =
int (&)(int); _Constraints = void; _Res = int; _ArgTypes = {int}]/302 inlined 
  freq:1.00
  Stack frame offset 32, callee self size 0
  std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor =
int (&)(int); _Constraints = void; _Res = int; _ArgTypes = {int}]/375 inlined
freq:0.33
Stack frame offset 32, callee self size 0
static void
std::_Function_base::_Base_manager<_Functor>::_M_init_functor(std::_Any_data&,
_Fn&&) [with _Fn = int (&)(int); _Functor = int (*)(int)]/310 inlined
  freq:0.33
  Stack frame offset 32, callee self size 0
  _M_create.isra/383 inlined
freq:0.33
Stack frame offset 32, callee self size 0
void* std::_Any_data::_M_access()/388 inlined
  freq:0.33
  Stack frame offset 32, callee self size 0
operator new.isra/386 inlined
  freq:0.33
  Stack frame offset 32, callee self size 0
  static bool
std::_Function_base::_Base_manager<_Functor>::_M_not_empty_function(_Tp*) [with
_Tp = int(int); _Functor = int (*)(int)]/308 inlined
freq:1.00
Stack frame offset 32, callee self size 0
  constexpr std::_Function_base::_Function_base()/299 inlined
freq:1.00
Stack frame offset 32, callee self size 0

[Bug middle-end/111573] New: lambda functions often not inlined and optimized out

2023-09-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111573

Bug ID: 111573
   Summary: lambda functions often not inlined and optimized out
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

#include 
using namespace std;
static int dosum(std::function fn)
{
   return fn(5,6);
}
int test()
{
  auto sum = [](int a, int b) {
return a + b;
  };
  int s;
  for (s = 0; s < 10; s++)
s+= dosum(sum);
  return s;
}
Gets optimized well only with early inlining. compiled with -fno-early-inlining
yields to:

_Z4testv:
.LFB2166:
.cfi_startproc
subq$56, %rsp
.cfi_def_cfa_offset 64
xorl%ecx, %ecx
.p2align 4,,10
.p2align 3
.L8:
leaq12(%rsp), %rdx
leaq8(%rsp), %rsi
movl$5, 8(%rsp)
leaq16(%rsp), %rdi
movl$6, 12(%rsp)
call   
_ZNSt17_Function_handlerIFiiiEZ4testvEUliiE_E9_M_invokeERKSt9_Any_dataOiS6_
leal1(%rcx,%rax), %ecx
cmpl$9, %ecx
jle .L8
movl%ecx, %eax
addq$56, %rsp
.cfi_def_cfa_offset 8
ret

So we fail to inline since ipa-prop fails to track the constant function
address.  I think this is really common in typical lambda function usage

[Bug middle-end/111552] New: 549.fotonik3d_r regression with -O2 -flto -march=native on zen between g:85d613da341b7630 (2022-06-21 15:51) and g:ecd11acacd6be57a (2022-07-01 16:07)

2023-09-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111552

Bug ID: 111552
   Summary: 549.fotonik3d_r regression with -O2 -flto
-march=native on zen between g:85d613da341b7630
(2022-06-21 15:51) and g:ecd11acacd6be57a (2022-07-01
16:07)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=297.527.0=296.527.0;

[Bug middle-end/111551] New: Fix for PR106081 is not working with profile feedback on imagemagick

2023-09-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111551

Bug ID: 111551
   Summary: Fix for PR106081 is not working with profile feedback
on imagemagick
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

As seen in
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.507.0=473.507.0=475.507.0=477.507.0;
Fix for PR106081 improved imagemagick significantly without FDO but not with
FDO.

[Bug tree-optimization/111498] New: 951% profile quality regression between g:93996cfb308ffc63 (2023-09-18 03:40) and g:95d2ce05fb32e663 (2023-09-19 03:22)

2023-09-20 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111498

Bug ID: 111498
   Summary: 951% profile quality regression between
g:93996cfb308ffc63 (2023-09-18 03:40) and
g:95d2ce05fb32e663 (2023-09-19 03:22)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

This is seen here on tramp3d -fprofile-use -fprofile-report
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=463.976.7

Looking at patches in range it may be:

commit d45ddc2c04e471d0dcee016b6edacc00b8341b16
Author: Richard Biener 
Date:   Thu Sep 14 13:06:51 2023 +0200

tree-optimization/111294 - backwards threader PHI costing

[Bug middle-end/110973] 9% 444.namd regression between g:c2a447d840476dbd (2023-08-03 18:47) and g:73da34a538ddc2ad (2023-08-09 20:17)

2023-08-29 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110973

--- Comment #5 from Jan Hubicka  ---
Note that some (not all?) namd scores seems to be back to pre-regression
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=798.120.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=791.120.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=299.120.0
between 2a0b19f52596d75b (2023-08-07 00:16) and b0894a12e9e04dea (2023-08-10
13:29)

[Bug ipa/111157] [14 Regression] 416.gamess fails with a run-time abort when compiled with -O2 -flto after r14-3226-gd073e2d75d9ed4

2023-08-26 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57

--- Comment #4 from Jan Hubicka  ---
So here ipa-modref declares the field dead, while ipa-prop determines its value
even if it is unused and makes it used later?

I think dead argument is probably better than optimizing out one store, so I
think ipa-prop, however question is how to detect this reliably.

ipa-modref has update_signature which updates summaries after ipa-sra work, so
it is also place to erase the info about parameter being dead from the summary.
Other option would be to ask ipa-modref from FRE when considering propagation
of known value.

  1   2   3   4   5   6   7   >