[Bug ipa/96059] ICE: in remove_unreachable_nodes, at ipa.c:575 with -fdevirtualize-at-ltrans

2024-05-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96059

--- Comment #7 from Jan Hubicka  ---
> Actually, let me drop the PR59859 blocker, as IIRC we've had reports of this
> downstream w/o graphite.
I think you edited wrong PR here.

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #7 from Jan Hubicka  ---
> and then we inline them back, introducing the extra copy.  Why do we use
> tail-calls here instead of aliases?  Why do we lack cost modeling here?
Because the function is exported and we must keep addresses different.
Cost modeling is somewhat hard here, since it is not clear what will
help inliner and whether inliner will inline back.

For example if we icf two function calling third function. the third one
may become called once and we get better code by inlining it and eliding
offline copy rather than keeping the caller duplicated.

So the idea is to get code into as canonical form as possible (with no
obvious duplicates) and then let the inliner heuristics to decide what
should and what should not be duplicated in the final code.

[Bug libstdc++/109442] Dead local copy of std::vector not removed from function

2024-05-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109442

--- Comment #21 from Jan Hubicka  ---
This patch attempts to add __builtin_operator_new/delete. So far they
are not optimized, which will need to be done by extra flag of BUILT_IN_
code.  also the decl.cc code can be refactored to be less of cut
and I guess has_builtin hack to return proper value needs to be moved
to C++ FE.

However the immediate problem I run into is that libstdc++ testuiste
fails due to lack of std::nothrow overrides.  I wonder how to get that
working?

[Bug tree-optimization/114959] incorrect TBAA for drived types involving function types

2024-05-07 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114959

--- Comment #4 from Jan Hubicka  ---
> 
> I think function types are somewhat special in that they do not denote
> objects in the classical sense.  They are also most complex and probably
> target-dependent to handle.
> 
> Note there's LTO where we glob all pointers to a single equivalence class
> because of Fortran where C_PTR inter-operates with all pointer types.  But

We special case void * to alias with all other pointer types and look
through pointers in alias.cc, so accesses through pointers are not
necessarily fully globbed.

I plan to look into unglobbing pointers when Fortran C_PTR can not clash
since this is relatively important piece of TBAA information.

Honza

[Bug tree-optimization/114774] Missed DSE in simple code due to interleaving sotres

2024-04-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

--- Comment #5 from Jan Hubicka  ---
> > Looking into it, instead of having simple outer loop it needs to
> > maintain worklist of defs to proceed each annotated with live bitmap,
> > rigt?
> 
> Yeah, I have some patch on some branch somewhere ... IIRC it was broken
> and miscompiled things and I got distracted ...
> 
> I will eventually get back to DSE for stage1 also because of some other
> PRs.

Great, thanks a lot!  I think next stage1 I can try to look more into
libstdc++ related performance issues - it kind of surprises me how many
things got noticed on push_back...

[Bug tree-optimization/114774] Missed DSE in simple code due to interleaving sotres

2024-04-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

--- Comment #3 from Jan Hubicka  ---
> Yes, DSE walking doesn't "branch" but goes to some length handling some 
> trivial
> branches only.  Mainly to avoid compile-time issues.  It needs larger
> re-structuring to fix that, but in principle it shouldn't be difficult. 

Looking into it, instead of having simple outer loop it needs to
maintain worklist of defs to proceed each annotated with live bitmap,
rigt?

With Maritn we noticed it while looking into push_back codegen. In case
aggregate is passed to a function call but does not escape, we now SRA
it removing all uses and corresponding clobbers, but we do not remove the
stores (Martin was expecting DSE to do it). As a result the store stays
around which causes partial memory stall (storing in pieces, loading as
a whole). It seems easy to remove this specific store during SRA, but it
seems improving DSE here would be desirable.

[Bug ipa/114703] Missed devirtualization in rather simple case

2024-04-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114703

--- Comment #3 from Jan Hubicka  ---
> Yep, 'new' memory escapes.
Yep, this is blocking a lot of propagation in common C++ code.
Here it may help to do speculative devirtualization during IPA stage
that will let the late optimization to get rid of the speculation (since
after inlning we will know that the virtual call does not overwrite
virtual table pointer).  This is technically not too hard to add.  We
can optimistically rule out (some) may aliases while walking the alias
oracle.   I will take a look next stage1.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled on x86 since r14-5109-ga291237b628f41

2024-04-09 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #76 from Jan Hubicka  ---
There is still problem with loop bounds.  I am testing patch on that and
then we should be (finally) finally safe.

[Bug gcov-profile/114115] xz-utils segfaults when built with -fprofile-generate (bad interaction between IFUNC and binding?)

2024-04-03 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114115

--- Comment #15 from Jan Hubicka  ---
> Fixed for GCC 14 so far
It is simple patch, so backporting is OK after a week in mainline.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled on x86 since r14-5109-ga291237b628f41

2024-04-02 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #70 from Jan Hubicka  ---
Hello,
over easter I did some analysis of the cases where ICF is now disabled
due to jump function miscompare.  Most common case (seen also on GCC) is
the situation where function is originally static inline and in some
units its callee is known which enables us to propagate return value
range.  It happens on wide int implementaiton but it is not that
importnat.

Other case is when function has address taken of local label and
ICF two functions taking address of local label and passing it to a
calle (maybe we should not but C standard says nothing).

I tought there is also case where jump function has other function local
stuff, but that does not happen since we avoid putting them to
jumptables (to avoid need to stream function blocal BLOCKS).

Last case is that the function takes as parameter an address of static
symbol that can be merged.  We currently don't do that since we miss any
logic tracking whether value address is eventually used to compare.  So

I think for release branchs and trunk Martin's patch is fine. For next
stage1 we will need to work on using icf's compare_op in the ipa-prop
code and also add merging of value ranges.

Honza

[Bug tree-optimization/112303] [14 Regression] ICE on valid code at -O3 on x86_64-linux-gnu: verify_flow_info failed since r14-3459-g0c78240fd7d519

2024-03-27 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112303

--- Comment #14 from Jan Hubicka  ---
> This patch fixes the ICE for me.
> Seems we already did something like that in other spots (e.g. in apply_scale).

In general if the overflow happens, some pass must have misbehaved and
do something crazy when updating profile.  But indeed we probably ought
to cap here instead of randomly getting to uninitialized. It may make
sense to make these enable checking only ICEs.   I will look into why
the overflow happens.

Honza

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #64 from Jan Hubicka  ---
> Are you going to apply this patch, even if it just helps partially with some
> tests and not others?
I think we should fix this completely, since it is source of very
suprising bugs.  I discussed it with Martin (since he is maintaining the
jump functions) and he will add comparator for them, so we can plug this
bug completely.  If we have operator= on functions, we can use it at the
spot I am comparing value ranges. 

There is also a need to compare loop structures.   Seems I have testcase
now too (triggering peeling on miscomputed upper bound on iteration
count). Will attach it after cleaning up.

[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-03-13 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #57 from Jan Hubicka  ---
> So, we can punt on differences there (that is desirable for backporting and
> maybe GCC 14 too), or we could at that point populate an int vector, which 
> maps
Yep, that is what I do.
I had bug in that so I am re-running (forgot to check that callers and
callee argument count matches and this cuases ICE during LLVM LTO link).
It seems these extra checks makes no difference in practice. 
During bootstrap there are no pairs of functions during bootstrap where
we new checks punt on value range difference or jump function
difference that would be merged otherwise.

Most common case where we could merge but we don't are those triggered
by TBAA.
> the callee
> vector indexes to indexes in the callee vector in the other candidate 
> function.
> If unsuccessful, we just free the vector, if successful, we first walk all the
> callees and union stuff in there using that vector.
This is the plan for metadata merging. A small complication here is that
ICF works by comparing bodies to a leader of equivalence class but this
leader is not necessarilly the surviving function body.  So if we
compared A to L (leader) and B to L and then decided replace A by B, we
need to be able to combine the permutations so we know how to map call
sites in A to ones in B.  The same is true about SSA names and basic
blocks.  I have patch for that for next stage1.

[Bug ipa/114317] Missing optimization for multiple condition statements

2024-03-12 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114317

--- Comment #2 from Jan Hubicka  ---
> (it would need to elide the stores of course).

We do have way to elide stores, since we can optimize out write-only
values.  What we do not have readilly available is the value written to
a reference (ipa-refs only connects variables/functions to statements
that write/load/take address).  

I think the plan is to add summary template for references, like we have
for callgraph edges and symbols.  Then we can keep track of values and
do optimizations like this.  It is on my todo for quite some time ;)

[Bug ipa/114262] Over-inlining when optimizing for size with gnu_inline function

2024-03-07 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114262

--- Comment #6 from Jan Hubicka  ---
> Note GCC has not retuned its -Os heurstics for a long time because it has been
> decent enough for most folks and corner cases like this is almost never come
> up.
There were quite few changes to -Os heuristics :)
One of bigger challenges is that we do see more and more C++ code built
with -Os which relies on certain functions to be inlined and optimized
in context, so we had to get more optimistic in a hope that inlined code
will optimize well.

COMDAT functions are more likely inlined because statistics shows that
many of them are not really shared between translations units
(see -param=comdat-sharing-probability parameter). This was necessary to
get reasonable code for Firefox approx 15 years ago.

[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition

2024-03-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241

--- Comment #2 from Jan Hubicka  ---
This indeed looks like bug caused by fact that the class is keyed into
one of the two units.

Outputting translation unit names is unfortunately hard, since they are
object files and often comming from .a archive where we know only
offset, but no original file name or something.

I had patch which passed through original filename attached to
TRALSATION_UNIT_DECL, but it also had problems, i.e. printing relative
filenames from wrong directory.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #26 from Jan Hubicka  ---
> I think optimize_function_for_size_p (cfun) isn't always true if
> optimize_size is since it looks at the function-specific setting
> of that flag, so you'd have to use opt_for_fn (cfun, optimize_size).

When we initialize obtabs, we do it for a givn optimization_node
setting, so opt_for_fn (cfun, optimize_size) should be equal to
optimize_size.

Honza

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #21 from Jan Hubicka  ---
Looking at the prototype patch, why need to change also the splitters?

My original goal was to use splitters to expand to faster code sequences
while having patterns necessary for both variants.  This makes it
possible to use optimize_insn_for_size/speed and make decisions using BB
profile, since we will not ICE if the hotness of BB changes later.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #18 from Jan Hubicka  ---
optimize_function_for_size_p is not really affected by LTO or non-LTO. 
It does take into account node->count and node->frequency, which is
updated during IPA, so it may change between early opts and late opts.

I do not think we change it between vectorizer and RTL (or during RTL)
and we can add this to a verifier if it is practical to rely on it (I
can see that vectorizer makes instruction selection choices).

But the problem here is more that optab initializations happens only at
the optimization_node changes and not if we switch from hot function to
cold?

[Bug tree-optimization/114052] [11/12/13/14 Regression] Wrong code at -O2 for well-defined infinite loop

2024-02-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114052

--- Comment #7 from Jan Hubicka  ---
> I see it doesn't do anything if mark_dfs_back_edges returns false, so it
> will claim the function is finite even when it calls a non-finite function?
> So I assume this is local analysis only and call edges will be handled
> elsewhere?

Yes, the side effects flags are transitively closed in callgraph at the
IPA propagation stage.
> 
> I found another similar compute, fill_always_executed_in in LIM
> (that considers all non-PURE calls as eventually looping, relying
> solely on gimple_has_side_effects here).
There are stmts with side effects that are always finite.
There is stmt_may_terminate_bb_p and stmt_may_terminate_function_p
which does more careful checking (and most of it should be unified I
guess).

I also had patches (never pushed) to track whether given callgraph edge
is always executed.  This is useful, for example, to bypass inliner
heuristics bounds on stack frame size.  It is also useful for profile
propagation.

So having this analysis factored out would be useful here, too.
> 
> In the end I want to have this on a stmt granularity (stmts before
> calls are OK, stmts after not).
Yep, if you have info whether BB is always executed, you still need to
walk its statements.
> 
> I guess greedily walking loop blocks along edges or walking in RPO oder
> and tracking whether there's no possible exit on the way to X would be
> the way to go.

That is what I would do. Maybe it would be nice to see though loops
known to be finite, so I guess if you check that it contains no stmts
that can terminate execution, then one can consider them safe

[Bug ipa/111960] [14 Regression] ICE: during GIMPLE pass: rebuild_frequencies: SIGSEGV (Invalid read of size 4) with -fdump-tree-rebuild_frequencies-all

2024-02-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111960

--- Comment #13 from Jan Hubicka  ---
> Should be fixed now.
Thanks! I was testing with stage3 compiler, so that is the reason.
Indeed dropping the buffer is a good idea.

[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #45 from Jan Hubicka  ---
> > "Once legacy evrp is removed, this won't be an issue, as ranges in the IL 
> > will tell the truth.  However, this will mean that we will no longer 
> > remove the first __builtin_unreachable combo.  But ISTM, that would be 
> > correct behavior ??."
> ...
> 
> But that doesn't cause these problems, just perhaps losing some info, right?
> If so, trying to change that feels like stage1 material to me.

Yep. It would be nice to not forget about that - it kept me confused for
over an hour :)

[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #43 from Jan Hubicka  ---
> // See discussion here:
> // https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571709.html
Discussion says:

"Once legacy evrp is removed, this won't be an issue, as ranges in the IL 
will tell the truth.  However, this will mean that we will no longer 
remove the first __builtin_unreachable combo.  But ISTM, that would be 
correct behavior ??."

So perhaps, we could remove that special case for default def and phi?
It is an odd thing and we clearly lose info here.
The problem is that value of parameter i itself does not have global
value range [0...10] so  I need to compute new SSA name to get it
preserved through ipa-split.  Maybe ipa-split can be extended to fire up
ranger and try to get better value ranges on function parameters from
the split function header.  Not sure if that is worth the effort though.


If we go with merging functions with different ranges, we indeed need to
update ranges and bits both in SSA_NAME infos and in ipa-prop's jump
functions.  At the time sem_fuction::merge calls ipa_merge_profiles we
do have both function bodies in memory and thus we can do the job.

If we just drop the info instead of merging, we do limited harm on
SSA_NAME infos since they mostly will be recomputed again.  For ipa-prop
this may cause more interesting precision loss.  

So perhaps for backportability we may want to just limit ICF to
functions wth same ranges in SSA_NAME infos.  Let me cook up a patch and
see if there is significant loss in merged functions. I think it should
be quite small given that ranges seem to only diverge through ipa-split
in very specific cases for now. (and given how much time I spent on
producing the testcase)

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #37 from Jan Hubicka  ---
> Also remember we like to have a fix that's easily backportable, and
> that's probably going to be resetting the info.  We can do something
> more fancy for GCC 15

Rejecting to merge function with different vlaue ranges is also easy,
but so is reseting.  I only need to check the ipa-prop.
Jakub, also I can take over the patch - and sorry for beging slow. I had
visitor doing some research last two weeks and I am trying to catch up
now.

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #36 from Jan Hubicka  ---
> > Having a testcase is great. I was just playing with crafting one.
> > I am still concerned about value ranges in ipa-prop's jump functions.
> 
> Maybe my imagination is too limited, but if the ipa-prop's jump functions are
> computed before ICF is performed, then if you call function a which makes some
> assumptions about its arguments and SSA_NAMEs used within the function and
> results in some return range
> and another one which makes different assumptions and results in a different
> range,
> then even if the two functions are merged and either the range info is thrown
> away
> or unioned, I think if some functions call one or the other functions then
> still the original range assumptions still apply, depending on which one
> actually called.

I think the tescase should have functions A1 and A2 calling function B.
We need to arrange A1 to ICF into A2 and make A2 to have narrower value
range of parameter of B visible to ipa-prop. Since ICF happens first,
ipa-prop will not see A1's value range and specialize B for A2's range.
Which sould make it possible to trigger wrong code.
> 
> > Let me see if I can modify the testcase to also trigger problem with value
> > ranges in ipa-prop jump functions.
> > 
> > Not streaming value ranges is an omission on my side (I mistakely assumed we
> > do stream them).  We ought to stream them, since otherwise we will lose
> > propagated return value ranges in partitioned programs, which is pity.
> 
> Not sure if it is a good idea to start streaming them now in stage4, but sure,
> if we
> stream them (and I think we should mostly have code to be able to stream that

It probably makes sense to postpone VR stremaing for stage1.
(Even thouh it is not hard to do.)  I will add that to my TODO.

> because we can stream the ranges in the jump functions, just unsure about
> points-to stuff),
> then I still think best approach would be to merge regardless of range info,
> but in that case preferrably union instead of throw away.  The only question 
> is
> where to do the merging for the LTO case (where to stream the union of the
> ranges and where to read it in and update for the SSA_NAMEs).

With LTO ICF streams in all function bodies for comparsion to WPA and
keeps the winning one, so it is about extending the comparator to keep
track of difference of value ranges for all compared pairs.

There is code to merge profiles, so at that place one can also do other
kind of metadata merging.  ICF needs a lot of love on this - value
range merging is just one example.  We also want to merge TBAA (so we
can merge different template instatiations) and other stuff.

At the moment we handle all other metadat conservatively (i.e. do not
attempt to merge, just refuse merging if it differs) so value ranges are
first that are handled aggressively with your patch.

I think it is fine, since most value range will be recomputed in late
optimization - the only exceptions are the return value ranges for now.

I will try to work on making this more systematic next stage1.

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #19 from Jan Hubicka  ---
> Note I didn't check if it helps the testcase ..

I will check.
> 
> > > 
> > > A "nicer" solution might be to add a informational operand
> > > to TARGET_MEM_REF, representing the base pointer to be used for
> > > alias/points-to purposes.  But if that's not invariant it might
> > > keep some otherwise unnecessary definition stmts live.
> > 
> > Yep, I see that forcing extra IV to track original semantics would be
> > trouble here.  I think that after iv-opts we should be done with more
> > fancy propagation across loops.
> > 
> > However, to avoid ipa-modref summary degradation, perhaps scheduling the
> > pass before ivopts would make sense...
> 
> We also have that other bug where store-merging breaks the IPA summary
> which gets prevailed in the late modref, so moving it around doesn't
> solve all of the IL related issues ...

I did not mean to paper around the fact that we produce wrong code with
TARGET_MEM_REFs (we ought to fix that).  If ivopts makes it
difficult to track bases of memory accesses, we may get better 
summaries if we built them earlier.  The purpose for late mod-ref is to
analyze the function body as cleaned up as possible (so we do not get
confused i.e. by dead memory accesses and other stuff we see before
inlining) but we do not want to lower the representation too much, since
that is generally lossy too.

I will look at the store merging issue.

Honza

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #17 from Jan Hubicka  ---
> > I guess PTA gets around by tracking points-to set also for non-pointer
> > types and consequently it also gives up on any such addition.
> 
> It does.  But note it does _not_ for POINTER_PLUS where it treats
> the offset operand as non-pointer.
> 
> > I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> > pointer_plus expression, but does not look through POINTER_PLUS.
> > We can restrict it further, but tracking base pointer is quite useful,
> > so it would be nice to not give up completely.
> 
> It looks like that function might treat that
> 
>  ADDR_EXPR >
> 
> as integer_zerop base.  It does
> 
>   if (TREE_CODE (op) == ADDR_EXPR) 
> {
>   poly_int64 extra_offset = 0; 
>   tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
>  );
>   if (!base)
> {
>   base = get_base_address (TREE_OPERAND (op, 0));
>   if (TREE_CODE (base) != MEM_REF)
> break;
>   offset_known = false;
> }
>   else
> {
>   if (TREE_CODE (base) != MEM_REF)
> break;
> 
> with a variable offset we fall to the TREE_CODE (base) != MEM_REF
> and will have offset_known == true.  Not sure what it does with
> the result though (it's not the address of a decl).
> 
> This function seems to oddly special-case != MEM_REF ... (maybe
> it wants to hande DECL_P () as finishing?

Hmm the function was definitely not written with TARGET_MEM_REF in mind,
since it was originally used for IPA passes only.
We basically want to handle stuff like
 >foo
or
 &(ptr->foo)
In the second case we want to continue the SSA walk to hopefully work
out the origin of PTR.
ipa-modref then looks if the base pointer is derived from function
parameter or points to local or readonly memory to produce its summary.
> 
> Note get_addr_base_and_unit_offset will return NULL for
> a TARGET_MEM_REF <, ..., offset> but TARGET_MEM_REF
> itself if the base isn't an ADDR_EXPR, irrespective of whether
> the offset within it is constant or not.

Hmm, interesting.  I would expect it to interpret the emantics of TMR
and return base.
> 
> Not sure if the above is a problem, but it seems the only caller
> will just call points_to_local_or_readonly_memory_p on the
> ADDR_EXPR where refs_local_or_readonly_memory_p via
> points_to_local_or_readonly_memory_p will eventually do
> 
>   /* See if memory location is clearly invalid.  */
>   if (integer_zerop (t))
> return flag_delete_null_pointer_checks;
> 
> and that might be a problem.  As said, we rely on
> ADDR_EXPR  > to be an address computation
> that's not subject to strict interpretation to allow IVOPTs
> doing this kind of optimization w/o introducing some kind of
> INTEGER_LEA <...>.  I know that's a bit awkward but we should
> make sure this is honored by IPA as well.
> 
> I'd say
> 
> diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> index 74c9b4e1d1e..45a770cf940 100644
> --- a/gcc/ipa-fnsummary.cc
> +++ b/gcc/ipa-fnsummary.cc
> @@ -2642,7 +2642,8 @@ points_to_local_or_readonly_memory_p (tree t)
> return true;
>return !ptr_deref_may_alias_global_p (t, false);
>  }
> -  if (TREE_CODE (t) == ADDR_EXPR)
> +  if (TREE_CODE (t) == ADDR_EXPR
> +  && TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF)
>  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
>return false;
>  }
> 
> might eventually work?  Alternatively a bit less aggressive like
> the following.
> 
> diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> index 74c9b4e1d1e..7c79adf6440 100644
> --- a/gcc/ipa-fnsummary.cc
> +++ b/gcc/ipa-fnsummary.cc
> @@ -2642,7 +2642,9 @@ points_to_local_or_readonly_memory_p (tree t)
> return true;
>return !ptr_deref_may_alias_global_p (t, false);
>  }
> -  if (TREE_CODE (t) == ADDR_EXPR)
> +  if (TREE_CODE (t) == ADDR_EXPR
> +  && (TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF
> + || TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) != 
> INTEGER_CST))
>  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
>return false;
>  }

Yes, those both looks reasonable to me, perhaps less agressive would be
better. 
> 
> A "nicer" solution might be to add a informational operand
> to TARGET_MEM_REF, representing the base pointer to be used for
> alias/points-to purposes.  But if that's not invariant it might
> keep some otherwise unnecessary definition stmts live.

Yep, I see that forcing extra IV to track original semantics would be
trouble here.  I think that after iv-opts we should be done with more
fancy propagation across loops.

However, to avoid ipa-modref summary degradation, perhaps scheduling the
pass before ivopts would make sense...

Thanks,
Honza

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-13 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #15 from Jan Hubicka  ---
> 
> IVOPTs does the above but it does it (or should) as
> 
>   offset = (uintptr) - (uintptr)
>   val = *((T *)((uintptr)base1 + i + offset))
> 
> which is OK for points-to as no POINTER_PLUS_EXPR is involved so the
> resulting pointer points to both base1 and base2 (which isn't optimal
> but correct).
> 
> If we somehow get back a POINTER_PLUS that's where things go wrong.
> 
> Doing the above in C code would be valid input so we have to treat
> it correctly (OK, the standard only allows back-and-forth
> pointer-to-integer casts w/o any adjustment, but of course we relax
> this).

OK. Modrefs tracks base pointer for accesses and tries to prove that
they are function parameters.  This should immitate ivopts:
void
__attribute__ ((noinline))
set(int *a, unsigned long off)
{
  *(int *)((unsigned long)a + off) = 1;
}
int
test ()
{
  int a;
  int b = 0;
  set (, (unsigned long) - (unsigned long));
  return b;
}

Here set gets following gimple at modref2 time:
__attribute__((noinline)) 
void set (int * a, long unsigned int off)
{
  long unsigned int a.0_1;
  long unsigned int _2;
  int * _3; 

   [local count: 1073741824]:
  a.0_1 = (long unsigned int) a_4(D);
  _2 = a.0_1 + off_5(D); 
  _3 = (int *) _2; 
  *_3 = 1; 
  return;

}

This is not pattern matched so modref does not think the access has a as
a base:

  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  Every access

While for:

void
__attribute__ ((noinline))
set(int *a, unsigned long off)
{
  *(a+off/sizeof(int))=1;
}

we produce:

__attribute__((noinline))
void set (int * a, long unsigned int off)
{
  sizetype _1;
  int * _2;

   [local count: 1073741824]:
  _1 = off_3(D) & 18446744073709551612;
  _2 = a_4(D) + _1;
  *_2 = 1;
  return;
}

And this is understood:

  stores:
  Base 0: alias set 1
Ref 0: alias set 1
  access: Parm 0

If we consider it correct to optimize out the conversion from and to
pointer type, then I suppose any addition of pointer and integer which
we do not see means that we need to give up on tracking base completely.

I guess PTA gets around by tracking points-to set also for non-pointer
types and consequently it also gives up on any such addition.

But what we really get from relaxing this?
> 
> IVOPTs then in putting all of the stuff into 'offset' gets at
> trying a TARGET_MEM_REF based on a NULL base but that's invalid.
> We then resort to a LEA (ADDR_EXPR of TARGET_MEM_REF) to compute
> the address which gets us into some phishy argument that it's
> not valid to decompose ADDR_EXPR of TARGET_MEM_REF to
> POINTER_PLUS of the TARGET_MEM_REF base and the offset.  But
> that's how it is (points-to treats (address of) TARGET_MEM_REF
> as pointing to anything ...).
> 
> > A quick fix would be to run IPA modref before ivopts, but I do not see how 
> > such
> > transformation can work with rest of alias analysis (PTA etc)
> 
> It does.  Somewhere IPA modref interprets things wrongly, I didn't figure
> out here though.


I guess PTA gets around by tracking points-to set also for non-pointer
types and consequently it also gives up on any such addition.

I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
pointer_plus expression, but does not look through POINTER_PLUS.
We can restrict it further, but tracking base pointer is quite useful,
so it would be nice to not give up completely.

Honza
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug gcov-profile/113646] PGO hurts run-time of 538.imagick_r as much as 68% at -Ofast -march=native

2024-02-01 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113646

--- Comment #4 from Jan Hubicka  ---
> 
> With -fprofile-partial-training the znver4 LTO vs LTOPGO regression (on a 
> newer
> master) goes down from 66% to 54%.  
> 
> So far I did not find a way to easily train with the reference run (when I add
> "train_with = refrate" to the config, I always get "ERROR: The workload
> specified by train_with MUST be a training workload!")

I do that with a crude hack of simply rewriting training data files with
reference versions in SPEC directly.   I believe that here problem must
be that with PGO we confuse vectorizer somehow.

I did not know there is train_with option.  Perhaps hacking the spec
driver to not output error is easy enough

[Bug ipa/113665] [11/12/13/14 regression] Regular for Loop results in Endless Loop with -O2 since r11-4987-g602c6cfc79ce4a

2024-01-30 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113665

--- Comment #8 from Jan Hubicka  ---
> Honza - ICF seems to fixup points-to sets when merging variables, so there
> should be a way to kill off flow-sensitive info inside prevailing bodies
> as well.  But would that happen before inlining the body?  Can you work
> on that?  I think comparing ranges would weaken ICF unnecessarily?

AFAIK ICF does no changes to winning function body. It basically relies
on the fact that early optimizations are local and thus arrive to same
solutions for most of metadata. So only really easy fix is to make it
match value ranges, too.  I will check how much that fire in practice -
I can only think of split funtions to diverge, which is probably not
that bad in practice.

IPA-prop and IPA-PTA is only done after ICF.

We indeed discussed clearing possibility of merging alias sets which is
relatively important in practice (increasing TBAA precision on LTO
slowly degraded ICF effectivity significantly), but got into glory
details of inventing representation which would make inliner to pick
right body (without alias sets cleared). This was never made to fly
(Martin got scared by the details and I got it on my ever overflowing
TODO list).

Honza

[Bug gcov-profile/113646] PGO hurts run-time of 538.imagick_r as much as 68% at -Ofast -march=native

2024-01-29 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113646

--- Comment #2 from Jan Hubicka  ---
> Did you try with -fprofile-partial-training (is that default on?  it probably
> should ...).  Can you please try training with the rate data instead of train

It is not on by default - the problem of partial training is that it
mostly nullifies any code size benefits from profile-use and that is
relatively noticebale aspect of it in real-world situations (like
for GCC itself or Firefox the overall size of binary matters).

I need to work on this more, but now we have two-state optimize_size
predicates and with level 1 we can turn off those -Os optimizations that
make large tradeoffs of performance for size optimization.

Honza
> to rule out a mismatch?

[Bug ipa/113478] -Os does not inline single instruction function

2024-01-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

--- Comment #4 from Jan Hubicka  ---
> Possibly, at least when we know it doesn't expand to a libatomic call?  OTOH
> even then a function just wrapping such call should probably be inlined,
> so the question is whether the problem that
> is estimated as too big compared to the call calling the function
> (OTOH a1.test () has no arguments while __atomic_load_1 has two).

If we really want to optimize for size, calling function with one
parameter is shorter then calling function with two parameters.  The
code size model takes into account when the offline copy of the function
will disappear and it also has some biass towards understanding that a
lot of comdat functions are not really shared across units.

The testcase calls function 15 times and I guess wrapper function on
most architectures is shorter than 15 load zero instructions...

We now have -Os and -Oz and two-level optimize_size predicates. We may
make this less restrictive with lower size optimization level. But when
optimizing for size and if __atomic_load was ordinary function call, I
think the decision is correct.

[Bug ipa/113478] -Os does not inline single instruction function

2024-01-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

--- Comment #2 from Jan Hubicka  ---
Probably is_inexpensive_bulitin_p should return true here?

[Bug c++/109753] [13/14 Regression] pragma GCC target causes std::vector not to compile (always_inline on constructor)

2024-01-11 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109753

--- Comment #14 from Jan Hubicka  ---
> I think the issue might be that whoever is creating
> __static_initialization_and_destruction_0 fails to honor the active
> target pragma.  Which means back to my suggestion to have multiple ones
> when different target options are on the individual CTORs and any of them
> have always-inline (with always-inline we can't rely on an out-of-line copy
> to exist).

If I remember right, __static_initialization_and_destruction_0 call all
static constructors of priority 0. So it has really no active pragma
since it may change across translation unit.

We also have similar code in IPA where we collect constructors
across whole program.  Motivation was to get them inlined and obtain
better code locality. Back then Firefox had iostram constructor in every
object file and those ctors made whole text segment to be loaded on
demand during startup.

Probably correct solution would be to group construtor into groups by
compatible optimization/target pgramas in the inliner's definition.
A quick hack would be to generate wrapper calls which will "eat" the
always inline...

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-04 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

--- Comment #14 from Jan Hubicka  ---
> I thought the goal was to handle what is in predict-18.c, i.e.
> b * __builtin_expect (c, 0)
> or similar.  If it is about
> __builtin_expect_with_probability (b, 42, 0.25) *
> __builtin_expect_with_probability (c, 0, 0.42)
> sure, my version will merge the probabilities, while you'll pick the
> probability from
> the 0 case.

Probability from 0 case is better estimate, so I think it makes sense to
handle it right.  I did not take that much stats on how often it
happens, but on my TODO list is to turn this into value range predictor
which may have better chance of success. We can also handle other
constants than INTEGER_CST.

I will see if I can clean up the code bit more or add a comment, since
it is indeed bit confusing as written now.  Will also look into more
testcases.

Thanks a lot!
Honza

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-04 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

--- Comment #11 from Jan Hubicka  ---
> > + int p1 = get_predictor_value (*predictor, *probability);
> > + int p2 = get_predictor_value (predictor2, probability2);
> > + /* If both predictors agrees, it does not matter from which
> 
> s/agrees/agree/
> 
> > + Consequently failing to fold both means that we will not suceed
> > determinging
> 
> s/suceed/succeed/;s/determinging/determining/

Fixed that, thanks!
> 
> Otherwise yes, but I think the code could be still simplified the way I had in
> my patch (i.e. drop parts of the r14-2219 changes, and simply assume that
> failed recursion for one operand is PRED_UNCONDITIONAL instead of returning
> early, and not requiring the operands are INTEGER_CSTs, just that the result 
> of
> the binop folds to INTEGER_CST.

I added the early exits to handle the following case.

a = b * c

If b is prediced to 0 with predictor1, while c is predicted to 1 with
predictor2 your version will predict a to be 0, but will merge
predictor1 and 2 leading to lower probability than predictor1 alone.
So the early exit will give bit higher chance for not losing
information.

The code is still lax if both b and c are predicted to 0 in which case
we can work out that combined probability is at least max of the two
predictor probabilities, but I was not sure if that is work extra
folding overhead.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-04 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

--- Comment #9 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852
> 
> --- Comment #7 from Jakub Jelinek  ---
> So, what about following patch (which also fixes the ICE, would of course need
> to add the testcase) and doesn't regress any predict-*.c tests)?
> 
> --- gcc/predict.cc.jj   2024-01-03 11:51:32.0 +0100
> +++ gcc/predict.cc  2024-01-04 16:28:55.041507010 +0100
> @@ -2583,44 +2583,36 @@ expr_expected_value_1 (tree type, tree o
>if (get_gimple_rhs_class (code) == GIMPLE_BINARY_RHS)
>  {
>tree res;
> -  tree nop0 = op0;
> -  tree nop1 = op1;
> -  if (TREE_CODE (op0) != INTEGER_CST)
> -   {
> - /* See if expected value of op0 is good enough to determine the
> result.  */
> - nop0 = expr_expected_value (op0, visited, predictor, probability);
> - if (nop0
> - && (res = fold_build2 (code, type, nop0, op1)) != NULL
> - && TREE_CODE (res) == INTEGER_CST)
> -   return res;
> - if (!nop0)
> -   nop0 = op0;
> -}

By removing the logic we lose ability to optimize things like
  a = b * c;
where b is predicted to value 0 and c has no useful prediction on it.
> @@ -2631,6 +2623,9 @@ expr_expected_value_1 (tree type, tree o
> 
>   if (predictor2 < *predictor)
> *predictor = predictor2;
> + if (*predictor != PRED_BUILTIN_EXPECT
> + && *predictor != PRED_BUILTIN_EXPECT_WITH_PROBABILITY)
> +   *probability = -1;

This still can "upgrade" prediction to a predictor of lower enm value
but higher probability that is not conservative thing to do.
> 
>   return res;
> }
I ended up with the folloing patch that also takes care of various cases
of phi merging and downgrading the predictor to new
PRED_COMBINED_VALUE_PREDICTION which can, like PRED_BUILTIN_EXPECT hold
custom probability but it is not trued as FIRST_MATCH.
What do you think?

gcc/ChangeLog:

* predict.cc (expr_expected_value_1):
(get_predictor_value):
* predict.def (PRED_COMBINED_VALUE_PREDICTIONS):

diff --git a/gcc/predict.cc b/gcc/predict.cc
index 2e9b7dd07a7..cdfaea1e607 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -2404,16 +2404,29 @@ expr_expected_value_1 (tree type, tree op0, enum
tree_code code,
   if (!bitmap_set_bit (visited, SSA_NAME_VERSION (op0)))
return NULL;

-  if (gimple_code (def) == GIMPLE_PHI)
+  if (gphi *phi = dyn_cast  (def))
{
  /* All the arguments of the PHI node must have the same constant
 length.  */
- int i, n = gimple_phi_num_args (def);
- tree val = NULL, new_val;
+ int i, n = gimple_phi_num_args (phi);
+ tree val = NULL;
+ bool has_nonzero_edge = false;
+
+ /* If we already proved that given edge is unlikely, we do not need
+to handle merging of the probabilities.  */
+ for (i = 0; i < n && !has_nonzero_edge; i++)
+   {
+ tree arg = PHI_ARG_DEF (phi, i);
+ if (arg == PHI_RESULT (phi))
+   continue;
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (!cnt.initialized_p () || cnt.nonzero_p ())
+   has_nonzero_edge = true;
+   }

  for (i = 0; i < n; i++)
{
- tree arg = PHI_ARG_DEF (def, i);
+ tree arg = PHI_ARG_DEF (phi, i);
  enum br_predictor predictor2;

  /* If this PHI has itself as an argument, we cannot
@@ -2422,26 +2435,50 @@ expr_expected_value_1 (tree type, tree op0, enum
tree_code code,
 PHI args then we can still be sure that this is
 likely a constant.  So be optimistic and just
 continue with the next argument.  */
- if (arg == PHI_RESULT (def))
+ if (arg == PHI_RESULT (phi))
continue;

+ /* Skip edges which we already predicted as unlikely.  */
+ if (has_nonzero_edge)
+   {
+ profile_count cnt = gimple_phi_arg_edge (phi, i)->count ();
+ if (cnt.initialized_p () && !cnt.nonzero_p ())
+   continue;
+   }
  HOST_WIDE_INT probability2;
- new_val = expr_expected_value (arg, visited, ,
-);
+ tree new_val = expr_expected_value (arg, visited, ,
+ );
+ /* If we know nothing about value, give up.  */
+ if (!new_val)
+   return NULL;

- /* It is difficult to combine value predictors.  Simply assume
-that later predictor is weaker and take its prediction.  */
- if (*predictor < predictor2)
+ /* If this is a first edge, trust its prediction.  */
+ if 

[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35

2024-01-04 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852

--- Comment #6 from Jan Hubicka  ---
> which fixes the ICE by preferring PRED_BUILTIN_EXPECT* over others.
> At least in this case when one operand is a constant and another one is
> __builtin_expect* result that seems like the right choice to me, the fact that
> one operand is constant doesn't mean the outcome of the binary operation is
> unconditionally constant when the other operand is __builtin_expect* based.

The code attempt to solve an problem with no good solution.  If you have two
probabilities that certain thing happens, the combination of them is
neither correct (since we do not know that the both probabilities are
independent) nor the resulting value corresponds to one of the two
predictors contributing to the outcome.

One exception is when one value is 100% sure, which is the case of
PRED_UNCDITIONAL.  So I would say we could simply special case 0% and
100% probability and pick the other predictor in that case.

> But reading the
> "This incorrectly takes precedence over more reliable heuristics predicting
> that call
> to cold noreturn is likely not going to happen."
> in the description makes me wonder whether that is what we want always 
> (though,
> I must say I don't understand the cold noreturn argument because
> PRED_COLD_FUNCTION is never returned from expr_expected_value_1.  But

Once expr_epeced_value finishes its job, it attaches precitor to an edge
and later all predictions sitting on an edge are combined.  If you end
up declaring prediction as BUILTIN_EXPECT_WITH_PROBABILITY, the logic
combining precitions will believe that the value is very reliable and 
will ignore other predictors.  This is the PRED_FLAG_FIRST_MATCH mode.

So computing uncertain probability and declaring it to be
BUILTIN_EXPECT_WITH_PROBABILITY is worse than declaring it to be a
predictor with PRED_DS_THEORY merging mode
(which assumes that the value is detrmined by possibly unreliable
heuristics)

So I would go with special casing 0% and 100% predictors (which can be
simply stripped and forgotten). For the rest we could probably introduce
PRED_COMBINED_VALUE which will be like BUILTIN_EXPECT_WITH_PROBABILITY
but with DS_THEORY meging mode.  It is probably better than nothing, but
certainly can not be trusted anymore.
> 
> Oh, another thing is that before the patch there were two spots that merged 
> the
> predictors, one for the binary operands (which previously picked the weaker
> predictor and now picks stronger predictor), but also in the PHI handling
> (which still picks weaker predictor); shouldn't that be changed to match,
> including the != -1 handling?

PHI is even more fishy, since we have no idea of probability of entering
the basic block with a given edge.  We can probably ignore basic blocks
that are already having profile_count of 0.

Otherwise we may try to do DS theory combination of the incomming
values.  I can cook up a patch.

(also fixing other profile related issue is very high on my TODO now)

[Bug target/113233] LoongArch: target options from LTO objects not respected during linking

2024-01-04 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113233

--- Comment #3 from Jan Hubicka  ---
> Confirm.  But option save/restore has been always implemented:
> 
> .section.gnu.lto_.opts,"",@progbits
> .ascii  "'-fno-openmp' '-fno-openacc' '-fno-pie' '-fcf-protection"
> .ascii  "=none' '-mabi=lp64d' '-march=loongarch64' '-mfpu=64' '-m"
> .ascii  "simd=lasx' '-mcmodel=normal' '-mtune=loongarch64' '-flto"
> .ascii  "'\000"
> 
> So -msimd=lasx is correctly recorded.  Not sure why it does not work.

With LTO we need to mix code compiled with different sets of options.
For this reason we imply for every function defition and optimization
and target attribute which record the flags.  So it seems target
attribute is likely broken for this flag.

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-12-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

--- Comment #20 from Jan Hubicka  ---
> 
> Live patching (user-space) doesn't depend on any particular alignment of
> functions, on x86-64 at least.  (The plan for other architectures wouldn't 
> need
> any specific alignment either).  Note that the above complaints about missing
> alignment is for kernel (!) ftrace/livepatching on arm64 (!), not on x86_64.

I had impression that x86-64 also needs forced alignment so the patching
can be always done atomically.  But it is not a big practical difference
if we go with a flag specifying minimal function alignment.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-29 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #32 from Jan Hubicka  ---
> /tmp/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_algobase.h:437:
> warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)'
> writing between 2 and 9223372036854775806 bytes into a region of size 0
> overflows the destination [-Wstringop-overflow=]

It warns on:

  template
struct __copy_move<_IsMove, true, random_access_iterator_tag>
{
  template
_GLIBCXX20_CONSTEXPR
static _Up*
__copy_m(_Tp* __first, _Tp* __last, _Up* __result)
{
  const ptrdiff_t _Num = __last - __first;
  if (__builtin_expect(_Num > 1, true))
__builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
  else if (_Num == 1)
std::__copy_move<_IsMove, false, random_access_iterator_tag>::
  __assign_one(__result, __first);
  return __result + _Num;
}
};

It is likely false positive on a code path that never happens in real
code, but we now optimize it better.

Does it show an inline path?

[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function

2023-11-27 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #15 from Jan Hubicka  ---
Thanks a lot for working on this!  I think it is quite importnat part of
the puzzle of making libstdc++ vector working reasonably well.

[Bug tree-optimization/112706] missed simplification in FRE

2023-11-24 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112706

--- Comment #3 from Jan Hubicka  ---
Thanks, new pattern looks like noticeable improvement :)
Base+offset is effective for alias analysis and I suppose it happens
reasonably enough for compares as well.
>   _76 = _71 + 4;
>   # .MEM_154 = VDEF <.MEM_153>
>   x_3(D)->D.25942._M_implD.25172.D.25249._M_finishD.25175 = _76;
>   # .MEM_7 = VDEF <.MEM_154>
>   D.26033 = 0;
>   # .MEM_157 = VDEF <.MEM_7>
>   *_76 = 0;
>   # PT = nonlocal escaped 
>   _82 = _71 + 8;
>   # .MEM_158 = VDEF <.MEM_157>
>   x_3(D)->D.25942._M_implD.25172.D.25249._M_finishD.25175 = _82;
>   # .MEM_8 = VDEF <.MEM_158>
>   D.26033 ={v} {CLOBBER(eol)};
>   # .MEM_9 = VDEF <.MEM_8>
>   D.26034 = 0;
>   if (_66 != _82)
> ```
> After pre (note the first comparison is gone but not the second one and maybe 
> a
> 3rd). So this patch helps but it looks like a PRE/VN improvement is still
> needed to fix the others.
I think it is missing predication in VN. At each execution of CCP or VN
we work out one conditional to be true, but we stil account both paths
for the value number of the pointer used in next compare.

If vector used base+size pair instead of base+endptr VRP would help
here, but we can't vrp finish-start range...

[Bug tree-optimization/112678] [14 regression] Massive slowdown of compilation time with PGO

2023-11-23 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112678

--- Comment #2 from Jan Hubicka  ---
Seems we changed default to locking increments.

jh@ryzen4:/tmp> cat t.C
void
test()
{
}
jh@ryzen4:/tmp> ~/trunk-install/bin/g++ -O2 -fprofile-generate t.C -S ; grep
lock t.s
lock addl   $1, __gcov0._Z4testv(%rip)
lock addl   %eax, __gcov0._Z4testv+4(%rip)
jh@ryzen4:/tmp> g++ -O2 -fprofile-generate t.C -S ; grep lock t.s
jh@ryzen4:/tmp> 

so r:20a3c74c347429c109bc7002285b735be83f6a0b

Honza

[Bug middle-end/112653] We should optimize memmove to memcpy using alias oracle

2023-11-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #5 from Jan Hubicka  ---
> but the issue is that test2 escapes which makes this conflict:

It is passed to memmove which is noescape and returned.  Why local PTA
considers returned values to escape?

[Bug tree-optimization/111498] 951% profile quality regression between g:93996cfb308ffc63 (2023-09-18 03:40) and g:95d2ce05fb32e663 (2023-09-19 03:22)

2023-09-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111498

--- Comment #2 from Jan Hubicka  ---
> That just might cause a tid more early threading.  That is, expose latent
> profile updating issues elsewhere.  Looking at the graph we're also still very
> good compared to July.

Early threading should not cause profile disturbtions since it happens
before profile is estimated.

[Bug ipa/111157] [14 Regression] 416.gamess fails with a run-time abort when compiled with -O2 -flto after r14-3226-gd073e2d75d9ed4

2023-08-29 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57

--- Comment #8 from Jan Hubicka  ---
> This is what I wanted to ask about.  Looking at the dumps, ipa-modref
> knows it is "killed."  Is that enough or does it need to be also not
> read to be know to be useless?

The killed info means that the data does not need to be stored before
function call (since it will always be overwritten before reading).
So indeed that is what braks with ipa-cp/FRE transform now.

[Bug tree-optimization/110628] [14 regression] gcc.dg/tree-ssa/update-threading.c fails after r14-2383-g768f00e3e84123

2023-08-24 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110628

--- Comment #8 from Jan Hubicka  ---
patch posted
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628231.html

[Bug ipa/111088] useless 'xor eax,eax' inserted when a value is not returned and icf

2023-08-21 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111088

--- Comment #3 from Jan Hubicka  ---
> But adds a return with a value. And then the inliner inlines foo into foo2 but
> we still have the return with a value around ...
I guess ICF can special case unused return value, but why this is not
taken care of by ipa-sra?

[Bug tree-optimization/110628] [14 regression] gcc.dg/tree-ssa/update-threading.c fails after r14-2383-g768f00e3e84123

2023-08-17 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110628

--- Comment #6 from Jan Hubicka  ---
The mismatch happens on:
void foo (unsigned int x)
{
  if (x != 0x800 && x != 0x810)
abort ();
}

It is bug in reassoc turning:

void foo (unsigned int x)
{  
;;   basic block 2, loop depth 0, count 1073741824 (estimated locally, freq
1.), maybe hot
;;prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;pred:   ENTRY [always]  count:1073741824 (estimated locally, freq
1.) (FALLTHRU,EXECUTABLE)
  if (x_1(D) != 2048)
goto ; [66.00%]
  else
goto ; [34.00%]
;;succ:   3 [66.0% (guessed)]  count:708669600 (estimated locally, freq
0.6600) (TRUE_VALUE,EXECUTABLE)
;;5 [34.0% (guessed)]  count:365072224 (estimated locally, freq
0.3400) (FALSE_VALUE,EXECUTABLE)

;;   basic block 3, loop depth 0, count 708669600 (estimated locally, freq
0.6600), maybe hot
;;prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;pred:   2 [66.0% (guessed)]  count:708669600 (estimated locally, freq
0.6600) (TRUE_VALUE,EXECUTABLE)
  if (x_1(D) != 2064)
goto ; [0.00%]
  else
goto ; [100.00%] 
;;succ:   4 [never]  count:0 (precise, freq 0.)
(TRUE_VALUE,EXECUTABLE)
;;5 [always]  count:708669600 (estimated locally, freq 0.6600)
(FALSE_VALUE,EXECUTABLE)

;;   basic block 4, loop depth 0, count 0 (precise, freq 0.), probably
never executed
;;prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;pred:   3 [never]  count:0 (precise, freq 0.)
(TRUE_VALUE,EXECUTABLE)
  abort ();
;;succ:   



to:


void foo (unsigned int x)
{
  unsigned int _4;
  _Bool _5;

;;   basic block 2, loop depth 0, count 1073741824 (estimated locally, freq
1.), maybe hot
;;prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;pred:   ENTRY [always]  count:1073741824 (estimated locally, freq
1.) (FALLTHRU,EXECUTABLE)
  _4 = x_1(D) & 4294967279;
  _5 = _4 != 2048;
  if (_5 != 0)
goto ; [66.00%]
  else
goto ; [34.00%]
;;succ:   3 [66.0% (guessed)]  count:708669600 (estimated locally, freq
0.6600) (TRUE_VALUE,EXECUTABLE)
;;4 [34.0% (guessed)]  count:365072224 (estimated locally, freq
0.3400) (FALSE_VALUE,EXECUTABLE)

;;   basic block 3, loop depth 0, count 0 (precise, freq 0.), probably
never executed
;;   Invalid sum of incoming counts 708669600 (estimated locally, freq 0.6600),
should be 0 (precise, freq 0.)
;;prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;pred:   2 [66.0% (guessed)]  count:708669600 (estimated locally, freq
0.6600) (TRUE_VALUE,EXECUTABLE)
  abort ();
;;succ:

;;   basic block 4, loop depth 0, count 1073741824 (estimated locally, freq
1.), maybe hot
;;   Invalid sum of incoming counts 365072224 (estimated locally, freq 0.3400),
should be 1073741824 (estimated locally, freq 1.)
;;prev block 3, next block 1, flags: (NEW, REACHABLE, VISITED)
;;pred:   2 [34.0% (guessed)]  count:365072224 (estimated locally, freq
0.3400) (FALSE_VALUE,EXECUTABLE)
  return;

So it combines two conditionals together but does not update the
outgoing probabilitis of the conditional.
On x86-64 we have already in cfg dump:

  _1 = x != 2048;
  _2 = x != 2064;
  _3 = _1 & _2;
  if (_3 != 0)
goto ; [INV]
  else

I wonder why optimization diverges here?

[Bug tree-optimization/106293] [13/14 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2023-07-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293

--- Comment #19 from Jan Hubicka  ---
> This heuristic wants to catch
> 
>   
>   if (foo) abort ();
>   
> 
> and avoid sinking "too far" across a path with "similar enough"
> execution count (I think the original motivation was to fix some
> spilling / register pressure issue).  The loop depth test
> should be !(bb_loop_depth (best_bb) < bb_loop_depth (early_bb))

I am still concenred that loop_depth (bb1) < loop_depth (bb2)
does not really imply that bb1 is not in different loop nest with
loop with significantly higher iteration count than bb2...
> so we shouldn't limit sinking to a more outer nest.  As we rule
> out > before this becomes ==.
> 
> It looks tempting to sink to the earliest place with the same
> execution count rather than the latest but the above doesn't
> really achive that (it doesn't look "upwards" but simply fails).
> With a guessed profile it's also going to be hard.

Statistically guessed profile works quite well for things like placement
of splills in IRA (not perfectly of course) and this looks like kind of
similar thing.  So perhaps it could work reasoably well...
> 
> And it in no way implements register pressure / spilling sensitivity
> (see also Ajits attempts at producing a patch that avoids sinking
> across a call).  All these are ultimatively doomed unless we at least
> consider a group of stmts together.

hmm, life is hard :)
Honza

[Bug middle-end/110832] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-27 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

--- Comment #2 from Jan Hubicka  ---
I tested that the profile change makes no difference.

[Bug target/110758] [14 Regression] 8% hmmer regression on zen1/3 with -Ofast -march=native -flto between g:8377cf1bf41a0a9d (2023-07-05 01:46) and g:3a61ca1b9256535e (2023-07-06 16:56); g:d76d19c9bc5

2023-07-21 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110758

--- Comment #2 from Jan Hubicka  ---
> I suspect this is most likely the profile updates changes ...
Quite possibly. The goal of this excercise is to figure out if there are
some bugs in profile estimate or whether passes somehow preffer broken
profile or if it is just back luck.

Looking at sphinx and fatigue it seems that LRA really may preffer
increased profile counts in peeled vectorized loop since it does not
understand the fact that putting spill on critical path through the
dependnecy graph of the code is not good for out of order execution.

[Bug tree-optimization/110628] [14 regression] gcc.dg/tree-ssa/update-threading.c fails after r14-2383-g768f00e3e84123

2023-07-13 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110628

--- Comment #3 from Jan Hubicka  ---
> -fdump-tree-all-blocks-details produced more than 100 dump files.  Which 
> one(s)
> do you want?
Can you just zip them an attach all?
Thank you!
Honza

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-07-11 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #23 from Jan Hubicka  ---
But it would be nice to see why the functions are not early inlined.

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-07-11 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #22 from Jan Hubicka  ---
I will cook up the patch to keep multiple variants of nodes pre-inline
and we will see how much that affects compile time & how hard it will be
to get unit size esitmates right.

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #16 from Jan Hubicka  ---
> > We already have plenty of GF_CALL_ flags, so adding one should be easy?
> 
> We have 3 bits left :/  I was hoping that cgraph_edge lives long
> enough?  But I suppose we're not keeping them across the early opts
> pipeline.

Hmm, so we have too many flags.  Indeed problem is that we don't want to
keep callgraph edges across all modifications gimple optimization passes
does.
Eventualy such annotations can probably go to hash_map just like we do
for EH regions etc.

Honza

[Bug tree-optimization/109689] [14 Regression] ICE at -O1 with "-ftree-vectorize": in check_loop_closed_ssa_def, at tree-ssa-loop-manip.cc:645 since r14-301-gf2d6beb7a4ddf1

2023-06-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109689

--- Comment #10 from Jan Hubicka  ---
> > So perhaps simply:
> >   rewrite_into_loop_closed_ssa (NULL, 0);
> > in case we unlooped in loop closed ssa form (which is not that common).
> > Would that be acceptable?
> 
> Yes, we do that in other places as well.
OK, I will test the fix.

Thanks
Honza

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #14 from Jan Hubicka  ---
> 
> why disallow caller->indirect_calls?
See testcase in comment #9
> 
> > +   return false;
> > +  for (cgraph_edge *e2 = callee->callees; e2; e2 = e2->next_callee)
> 
> I don't think this flys - it looks quadratic.  Can we compute this
> in the inline summary once instead?

I guess I can place a cache there.  I think this check will become more
global over time so it more fits IMO here.
> 
> As for indirect calls, can we maybe mark initial direct GIMPLE call
> stmts as "always-inline" and only look at that marking, thus an
> indirect call will never become "always-inline"?  Iff cgraph edges
> prevail during all early inlining we could mark call edges for
> this purpose?

I also think we need call site specific info.
Tagging gimple call statements and copying the info to gimple edges will
probably be needed here.  We want to keep the info from early inlining
to late inlining since we output errors late.
We already have plenty of GF_CALL_ flags, so adding one should be easy?

Honza

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-26 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #11 from Jan Hubicka  ---
Hi,
what about this. It should make at least quite basic inlining to happen
to always_inline. I do not think many critical always_inlines have
indirect calls in them.  The test for lto is quite bad and I can
work on solving this incrementally (it would be nice to have this
tested and possibly backport it).

diff --git a/gcc/ipa-inline.cc b/gcc/ipa-inline.cc
index efc8df7d4e0..dcec07e49e1 100644
--- a/gcc/ipa-inline.cc
+++ b/gcc/ipa-inline.cc
@@ -702,6 +702,38 @@ can_early_inline_edge_p (struct cgraph_edge *e)
   if (!can_inline_edge_p (e, true, true)
   || !can_inline_edge_by_limits_p (e, true, false, true))
 return false;
+  /* When inlining regular functions into always-inline functions
+ during early inlining watch for possible inline cycles.  */
+  if (DECL_DISREGARD_INLINE_LIMITS (caller->decl)
+  && lookup_attribute ("always_inline", DECL_ATTRIBUTES (caller->decl))
+  && (!DECL_DISREGARD_INLINE_LIMITS (callee->decl)
+ || !lookup_attribute ("always_inline", DECL_ATTRIBUTES
(callee->decl
+{
+  /* If there are indirect calls, inlining may produce direct call.
+TODO: We may lift this restriction if we avoid errors on formely
+indirect calls to always_inline functions.  Taking address
+of always_inline function is generally bad idea and should
+have been declared as undefined, but sadly we allow this.  */
+  if (caller->indirect_calls || e->callee->indirect_calls)
+   return false;
+  for (cgraph_edge *e2 = callee->callees; e2; e2 = e2->next_callee)
+   {
+ struct cgraph_node *callee2 = e2->callee->ultimate_alias_target ();
+ /* As early inliner runs in RPO order, we will see uninlined
+always_inline calls only in the case of cyclic graphs.  */
+ if (DECL_DISREGARD_INLINE_LIMITS (callee2->decl)
+ || lookup_attribute ("always_inline", callee2->decl))
+   return false;
+ /* With LTO watch for case where function is later replaced
+by always_inline definition.
+TODO: We may either stop treating noninlined cross-module always
+inlines as errors, or we can extend decl merging to produce
+syntacic alias and honor always inline only in units it has
+been declared as such.  */
+ if (flag_lto && callee2->externally_visible)
+   return false;
+   }
+}
   return true;
 }

@@ -3034,18 +3066,7 @@ early_inliner (function *fun)

   if (!optimize
   || flag_no_inline
-  || !flag_early_inlining
-  /* Never inline regular functions into always-inline functions
-during incremental inlining.  This sucks as functions calling
-always inline functions will get less optimized, but at the
-same time inlining of functions calling always inline
-function into an always inline function might introduce
-cycles of edges to be always inlined in the callgraph.
-
-We might want to be smarter and just avoid this type of inlining.  */
-  || (DECL_DISREGARD_INLINE_LIMITS (node->decl)
- && lookup_attribute ("always_inline",
-  DECL_ATTRIBUTES (node->decl
+  || !flag_early_inlining)
 ;
   else if (lookup_attribute ("flatten",
 DECL_ATTRIBUTES (node->decl)) != NULL)

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-23 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #9 from Jan Hubicka  ---
Just so it is somewhere, here is a testcase that we can't inline leaf
functions to always_inlines unless we do some tracking of what calls
were formerly indirect calls.

We really overloaded always_inline from the original semantics "drop
inlining heuristics" into "be sure that result is inlined" while for
the second it does not make sense to take its address.
Clang apparently simply does not error on failes always inlines which
makes its life easier.

int n;
typedef void (*fnptr)();
fnptr get_me();
__attribute__ ((always_inline))
inline void test(void)
{
if (n < 10)
  (get_me())();
n++;
return;
}
fnptr get_me()
{
return test;
}
void
foo()
{
test();
}

[Bug ipa/110334] [13/14 Regresssion] unused functions not eliminated before LTO streaming

2023-06-23 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

--- Comment #8 from Jan Hubicka  ---
> > I was playing with the idea of warning when at lto time when comdats have
> > different command line options, but this triggers way too often in practice.
> 
> Really? :/
Yep, for example firefox consist of many extra libraries (skia, video
codecs, image format decoders...) each developed independently LTOed
into one libxul.  Each of them has its own configure that tampers with
command line options.
These brings in libstdc++ comdats with different command line
sets (over 100 of different command lines for std::vector).

Firefox is bit extreme, but it is common for other bigger projects, too.
> 
> I think it would be desirable to diagnose these, maybe with an option to
> selectively disable this specific diagnostic.  Because while it is not
> always a correctness issue it can be a performance issue as well.

I can dig out the patch then.  But it was simply printing something like
warning: merging comdat function with function of same name but
different flags
note: first difference is -f.
which was produced similar way we do diffs for optimization and target
attributes.

Now those -f were usually quite misleading as they tended to be
internal flags implied by something else (such as -Ofast instead of
-O2).  Often the picked -f flag was obviously harmless and false
positive, however other -f flag might have been important.

So I concluded that it is not very easy information to interpret from user
perspective.

But indeed I agree htat this is not very obvious source of interesting
correctness & performance issues.
> 
> Beware of new always-inline calls then appearing after greedy inlining
> (though that's exactly the case that we try to avoid here).  I suppose
> you could disable inlining of a function which contains always-inline
> calls or simply functions that did not yet have the early inliner run
> on them (so keep the current behavior in cycles).  Beware of indirect
> always-inline calls then.
> 
> Btw, for Skia the issue is really that some auto-generated CTOR isn't
> marked always-inline but everything else is.  Maybe they should use
> flatten instead of always-inline ...

We disable inlining to always_inline during early inline, but not greedy
inline.  Both of them can turn indirect calls to direct calls.  So I was
thinking that as first cut we can inline only callees with no indirect
calls and no inlinable direct calls into always_inlines with no indirect
calls.

Are you worried about possibility of early opt inventing new call into
builtin function that is defined as always_inline?

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #7 from Jan Hubicka  ---
> 
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.

If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can __builtin_constant_p it as well and check that
max_size () * sizeof ()
is close to ptrdiff_max.  

Thanks for the comments on the patches.  I will try to update the patch.

I was wondering about the allocators. As shown in the mail, optimiznig
_M_check_len still leaves two independent throws for insanely large
ops.  Since allocator is user replaceable, I guess we can not add new
member function for safe_allocate or so.

We can use __builtin_unreachable to set the value range on the return
value.  For that to work during early optimizations we need 

 1) extend early VRP to retrofit the value determined by
__builtin_unreachable to the SSA name defned earlier based on fact
that the execution can not legally terminate in between
 2) teaching inliner to ignore conditionals guaring __builtin_unreacable
 3) add support for return functions to propagate the value range from
_M_check_len to _M_reallocate_insert.
so it is correctly propagated to allocator call.

This is not very easy, but can be generally useful elsewhere.

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-18 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #5 from Jan Hubicka  ---
> Do you mean something like this?
I sent my own version, but yours looks nicer.
> 
> diff --git a/libstdc++-v3/include/bits/stl_vector.h
> b/libstdc++-v3/include/bits/stl_vector.h
> index 70ced3d101f..a4dbfeb8b5b 100644
> --- a/libstdc++-v3/include/bits/stl_vector.h
> +++ b/libstdc++-v3/include/bits/stl_vector.h
> @@ -1902,6 +1902,21 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
> return (__len < size() || __len > max_size()) ? max_size() : __len;
>}
> 
> +  // Called by _M_insert_aux etc.
> +  _GLIBCXX20_CONSTEXPR
> +  size_type
> +  _M_check_len_1(const char* __s) const
> +  {
> +   if (__builtin_constant_p(size()))
Perhaps ruling this out for 32bit ports? Or can we assume that half of
address space can never be allocated in single block?
> + {
> +   if (size() == 0)
> + return 1;
> +   else if (size() < max_size() / 2)
I think even this conditional is safe to be assumed to be true,
since we can not allocate half of 64bit address space.  
In general it is importnat to not fall through to _M_check_len.

As I noticed, we may want to use assume attribute to make clear that
retval <= max_size ()
to avoid other unnecesary throws downstream.

Honza

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #12 from Jan Hubicka  ---
> /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
> function `main':
> :(.text.startup+0x1): undefined reference to `GMCommand'

I wonder if your plugin is configured correctly.  Can you try to build
with -flto -fuse-linker-plugin.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #5 from Jan Hubicka  ---
> Actually why didn't we copy the loop header in the first place?
Because it is considered to be do-while loop already (thanks to
the in-loop conitional, do_while_loop_p is happy).

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #4 from Jan Hubicka  ---
> Rather, because store-motion out of a loop that might iterate zero times would
> create a data race.
Good point.  If we did copy loop headers all the way to the store the
problem will go away.  Also I assume we can still add a flag which is
set to true if loops iterates and then make store conditional...

[Bug c++/106943] GCC building clang/llvm with LTO flags causes ICE in clang

2023-05-12 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106943

--- Comment #19 from Jan Hubicka  ---
> > Is there any need to over-engineer this like that? I would hope enabling
> > -fno-lifetime-dse globally would not be controversial for LLVM

It would be really nice to have the ranger bug fixed.  Since lifetime
DSE is all handled in C++ FE there is no good reason why it should not
work to LTO togehter objects compiled with the flag and without...

llvm is really nice benchmark for LTO and PGO, so it would be nice if it
was not fragile and built with -O3 -flto.  In my testing GCC produced
binary with LTO+PGO is a lot smaller. It seems to also generate code
faster but parsing is bit slower, which I think may be related to
-fstrict-aliasing.

[Bug c++/106943] GCC building clang/llvm with LTO flags causes ICE in clang

2023-05-12 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106943

--- Comment #15 from Jan Hubicka  ---
> > Indeed it is quite long time problem with clang not building with lifetime
> > DSE and strict aliasing.  I wonder why this is not fixed on clang side?
> 
> Because the problems were not communicated? I knew that Firefox needed
> -flifetime-dse=1, but it's the first time I hear that any such problems in
> Clang/LLVM were identified.
> 
> I could not find any workaround for lifetime-dse in SUSE spec file for llvm16.
> Are you saying it was known and worked around somehow? Or it is not 
> manifesting
> because LLVM is built without LTO?

I think opensuse package outs-out LTO probably for this reason.  I am
sometimes using LLVM as benchmark of LTO and PGO, so it would be great
to have this enabled in the packages, but I had no time to do that so
far.  LLVM built with LTO and PGO builds quite a lot faster.  I was
filling bugreport for that some time ago and it seems that the bugreport
linked above has quite good analysis about what breaks.

[Bug c++/108887] [13 Regression] ICE in process_function_and_variable_attributes since r13-3601

2023-03-09 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108887

--- Comment #8 from Jan Hubicka  ---
> Also, reset() is only defined in cgraph_node, and I need it to work on both
> functions and variables.
Aha, this is a good point. I forgot that.  I will make reset() working
on symbols in general.  I think it is cleanest way ahead.
> 
> Clearing n->type seems to confuse things that expect all symbols to be either
> function or variable.
Yes, SYMBOL is not really used in symbol table except for temporary
situations during the construction. So I would indeed expect things to
be confused.

[Bug c++/101118] coroutines: unexpected ODR warning for coroutine frame type in LTO builds

2023-03-07 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101118

--- Comment #15 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101118
> 
> --- Comment #14 from Iain Sandoe  ---
> (In reply to Jan Hubicka from comment #13)
> > > So .. for promotion of target expression temporaries to frame vars, one 
> > > of:
> > >  - a) we need to find a different way to name them
> > I think we can just count number of fields within a given frame type?
> 
> yeah I made a hack that did this (and resolves this PR) but I'd think we can
> find something neater, I'd like to c++-ify the sources some more, and create a
> class to manage the frame... ( maybe for GCC14 now ).
Thanks!
Since this is really part of the ABI, I wonder why it is not covered by
the IA-64 C++ ABI?  Was it ever updated for coroutines?

Honza

[Bug c++/101118] coroutines: unexpected ODR warning for coroutine frame type in LTO builds

2023-03-07 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101118

--- Comment #13 from Jan Hubicka  ---
> So .. for promotion of target expression temporaries to frame vars, one of:
>  - a) we need to find a different way to name them
I think we can just count number of fields within a given frame type?

Honza

[Bug c++/108887] [13 Regression] ICE in process_function_and_variable_attributes since r13-3601

2023-03-03 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108887

--- Comment #5 from Jan Hubicka  ---
> Perhaps, but shouldn't we also unlink_from_assembler_name_hash (node, false);?
> I think the point of the current removal is that we've discovered the mangling
> alias clashes with some other symbol.
cgraph_nodes are associated to decls and we allow multiple nodes to
share same name.  So I think if you just reset it, things will work
since new node will be created for the clashing decl and the alias will
eventually be removed as unused (at least I hope so - this is a
slipperly side-case)

[Bug c++/101118] coroutines: unexpected ODR warning for coroutine frame type in LTO builds

2023-03-03 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101118

--- Comment #8 from Jan Hubicka  ---
> 
> the synthesised functions (actor, destroy) are intended to be TU-local.
> the ramp function is what remains of the user's original function after the
> coroutine body is outlined - so that has the original signature of the user's
> function.
> 
> We do use counters to generate local symbol names for frame-promoted
> temporaries, so I suppose that there is a possibility that the name(s) that 
> are
> intended to be TU-local become visible across TUs in LTO; but those should be
> the names of coroutine frame entries (i.e. fields in a structure, not
> themselves global symbols).

It is the type that we get warning on. So if it is never used to pass
data between translation units, the correct solution would be to put it
to the anoymous namespace.
(or avoid producing ODR name completely but that would lead to worse
code)

Honza

[Bug ipa/107931] [12/13 Regression] -Og causes always_inline to fail since r12-6677-gc952126870c92cf2

2023-02-21 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107931

--- Comment #20 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107931
> 
> --- Comment #17 from rguenther at suse dot de  ---
> On Mon, 20 Feb 2023, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107931
> > 
> > --- Comment #16 from Jakub Jelinek  ---
> > As discussed elsewhere, always_inline is used in various ways, in some 
> > users I
> > guess would like to see an error if it wasn't inlined, but e.g. the uses in
> > glibc for _FORTIFY_SOURCE where many of the standard C or POSIX APIs have
> > extern inline wrappers with always_inline and gnu_inline attributes.  In 
> > that
> > case always_inline is used there for security reasons (not inlining it means
> > the security checks aren't done) but on the other side C/POSIX requires 
> > 
> > or  to work too.
> 
> With GNU extern inline always_inline it would work to never emit bodies
> of always_inline functions to get the errors diagnosed at link time.
> Possibly standard C inline plus always_inline would work the same if we
> then treat the out of line instantiation as the destination for indirect
> calls.
> 
> But I think the cgraph code is still not 100% accurate in reflecting
> reality with extern inlines (and possibly C 'inline' and instantiations),
> since it doesn't treat the inline copy and the out of line copy 
> separately.
Yep, when inline version is replaced by offline version we forget inline
version. THis is remainer from one-definition rule work Zack did many years
ago. He changed C and C++ frontned to not produce duplciated
VAR/FUNCTION_DECLs with same assembler names and do in-place
replacements instead.  This was never fully finished and we get
duplciates in some scnearios.

In meantime cgraph knows how to handle this for LTO and we also support
weakrefs and syntactic aliases so keeping both definitions around is not
too hard from cgraph point of view.  We will however need to convince
front-ends to not overwrite the original decl holding the extern inline
body.

Honza

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #39 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552
> 
> --- Comment #35 from Vladimir Makarov  ---
> (In reply to Jakub Jelinek from comment #34)
> > Seems right now DECL_NONALIASED is only used on these coverage vars and on
> > Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
> > never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
> > Vlad on that.
> 
> The following patch can do this:

Note that with threads we often get large profile mismatches when the
load/stores are hoisted out of the loop.  I.e. 

for ()
  gcov_count++;

to

i = gcov_count
for (.)
  i++
gocv_count = i

If the second loop is run in parallel a lot of increments may be lost.

I was wonering if we should not provide flag to turn all counts
volatile.   That way we will still have race conditions on their updates
(and it would be chepaer than atomic) but we won't run into such wrong
code issues nor large profile mismatches.

[Bug bootstrap/107950] partial LTO linking of libbackend.a: gcc/gcc-rich-location.cc:207: undefined reference to `range_label_for_type_mismatch::get_text(unsigned int) const'

2023-01-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107950

--- Comment #9 from Jan Hubicka  ---
> 
> Feel free to grab my initial patch in c#0 and upstream it. I tried that some
> time ago in the following email thread:
> https://gcc.gnu.org/pipermail/gcc/2021-May/236096.html

Actually I was shooting for partial linking LTO streams just to make
individual WPAs chaper.  With -flinker-output=nolto-rel we are
effectively disabling most of benefits of LTO.
My main problem was arranging the partial link only when stage1/2
compiler is used so bootstrap works with older GCCs which suports LTO
but not partial linking yet.
> 
> > 
> > I also think it is the case where partial linking makes the symbol to be
> > pulled into LTO binary at the initial link time.  It should be optimized
> > away if linker was not complaining.
> 
> Optimize away during the partial linking? So you think it's a GCC issue when 
> it
> comes to partial linking?
No with partial linking you merge all the individual object files into
single. So if lto frontned is not using some of libbackend entry point,
you still get that code biult into it. So I think it is correct
behaviour.

[Bug tree-optimization/107467] [12/13 Regression] Miscompilation involing -Os , -flto and -fno-strict-aliasing since r12-656-ga564da506f52be66

2023-01-13 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107467

--- Comment #9 from Jan Hubicka  ---
> 
> so it's ICFed compare_pairs having modref TBAA info that makes the
> stores dead.  I suppose ICF needs to reset / alter the modref summaries?

Well, matching that ICF does should be enough to verify that modref
sumaries are same.  Modref only does what would happen after inlining
and ICF should be safe WRT both modref and inlining.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #9 from Jan Hubicka  ---
> 
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
> 
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput as in current incorrect models, and leave it to
> manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
> might be able to crowd-source it eventually.
Actually for older cores I think the manufacturers do not care much.  I
still have a working Bulldozer machine and I can do some testing.
I think in Buldozer case I was basing the latency throughput on data in
Agner Fog's manuals.  How do you test it?
Honza

[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

2022-11-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715

--- Comment #2 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
> 
> --- Comment #1 from Richard Biener  ---
> Because store data races are allowed with -Ofast masked stores are not used so
> we instead get
> 
>   vect__ifc__80.24_114 = VEC_COND_EXPR  vect__ifc__78.23_113>;
>   _ifc__80 = _58 ? _45 : _ifc__78;
>   MEM  [(double *)vectp_c.25_116] = vect__ifc__80.24_114;
> 
> which somehow is later turned into masked stores?  In fact we expand from
> 
>   vect__43.18_107 = MEM  [(double *) + ivtmp.75_134 * 1];
>   vect__ifc__78.23_113 = MEM  [(double *) + 8B +
> ivtmp.75_134 * 1];
>   _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102,
> vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113);
>   MEM  [(double *) + 8B + ivtmp.75_134 * 1] = _97;
>   vect__38.29_121 = MEM  [(double *) + ivtmp.75_134 * 1];
>   vect__39.32_124 = MEM  [(double *) + ivtmp.75_134 * 1];
>   _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
>   _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124,
> vect__38.29_121, vect__43.18_107);
>   MEM  [(double *) + ivtmp.75_134 * 1] = _100;
> 
> the vectorizer has optimize_mask_stores () which is supposed to replace
> .MASK_STORE with
> 
>  if (mask != { 0, 0, 0 ... })
>
> 
> and thus optimize the mask == 0 case.  But that only triggers for .MASK_STORE.
> 
> You can see this when you force .MASK_STORE via -O3 -ffast-math (without
> -fallow-store-data-races) you get this effect:

Yep, -fno-allow-store-data-races fixes the problem

jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast 
-march=native  -lm  
jh@alberti:~/tsvc/bin> perf stat ./a.out

 Performance counter stats for './a.out':

 37,289.50 msec task-clock:u  #1.000 CPUs utilized  
 0  context-switches:u#0.000 /sec   
 0  cpu-migrations:u  #0.000 /sec   
   431  page-faults:u #   11.558 /sec   
   137,411,365,539  cycles:u  #3.685 GHz   
  (83.33%)
   991,673,172  stalled-cycles-frontend:u #0.72% frontend cycles
idle (83.34%)
   506,793  stalled-cycles-backend:u  #0.00% backend cycles
idle  (83.34%)
 3,400,375,204  instructions:u#0.02  insn per cycle 
  #0.29  stalled cycles per
insn  (83.34%)
   200,235,802  branches:u#5.370 M/sec 
  (83.34%)
73,962  branch-misses:u   #0.04% of all branches   
  (83.33%)

  37.305121352 seconds time elapsed

  37.285467000 seconds user
   0.0 seconds sys


jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast 
-march=native  -lm  -fno-allow-store-data-races
jh@alberti:~/tsvc/bin> perf stat ./a.out

 Performance counter stats for './a.out':

667.95 msec task-clock:u  #0.999 CPUs utilized  
 0  context-switches:u#0.000 /sec   
 0  cpu-migrations:u  #0.000 /sec   
   367  page-faults:u #  549.439 /sec   
 2,434,906,671  cycles:u  #3.645 GHz   
  (83.24%)
19,681  stalled-cycles-frontend:u #0.00% frontend cycles
idle (83.24%)
12,495  stalled-cycles-backend:u  #0.00% backend cycles
idle  (83.24%)
 2,793,482,139  instructions:u#1.15  insn per cycle 
  #0.00  stalled cycles per
insn  (83.24%)
   598,879,536  branches:u#  896.588 M/sec 
  (83.78%)
50,649  branch-misses:u   #0.01% of all branches   
  (83.26%)

   0.668807640 seconds time elapsed

   0.66866 seconds user
   0.0 seconds sys

So I suppose it is L1 trashing. l1-dcache-loads goes up from
2,000,413,936 to 11,044,576,207

I suppose it would be too fancy for vectorizer to work out the overall
memory consumption here :) It sort of should have all the info...

Honza

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #7 from Jan Hubicka  ---
> 53730 r btver2_fp_min_issue_delay
> 53760 r znver1_fp_transitions
> 93960 r bdver3_fp_transitions
> 106102 r lujiazui_core_check
> 106102 r lujiazui_core_transitions
> 196123 r lujiazui_core_min_issue_delay
> 
> What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
Yes, I think that makes sense...

Honza

[Bug c++/107597] LTO causes static inline variables to get a non-uniqued global symbol

2022-11-11 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107597

--- Comment #2 from Jan Hubicka  ---
Hi,
What happens is that we read the symbol as:
  Visibility: externally_visible semantic_interposition
prevailing_def_ironly_exp public weak comdat comdat_group:_ZN12NonTemplated1xE
one_only
While in visibility pass we promote it to:
  Visibility: externally_visible semantic_interposition
prevailing_def_ironly_exp public comdat
So we disolve comdat group and turn off weak. This leads to better code
generation and we know it does not affect dynamic linking since in
shared library all symbols are interposable.

So this is kind of feature.   How the ODR violations are detected?
I wonder if keeping weak flag disturbs something.

[Bug ipa/106991] new+delete pair not optimized by g++ at -O3 but optimized at -Os

2022-09-21 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106991

--- Comment #2 from Jan Hubicka  ---
> Looks like inlining decisions decide to inline new but not delete but for -Os
> we inline none and elide the new/delete pair.
> 
> Maybe we can devise some inline hints to keep pairs?

Inliner is mostly built around an assumption that inline decisions are
idnependent on each call site.  It would be possible to add something
like that though: we could add links to inline summaries to hold the
pairs, modify can_inline and want_inline predicates to understand them
and extend inline_call to do both inlines when needed.  It will cause
some fallout since users of inline_call does not expect it to modify
other call sites.

I am not sure how good idea this would be though.  It seems to me that
it makes sense to treat them independently.

The reason we do not inline delete here is that delete used is different
form the one defined. Function calls
  operator delete(void*, unsigned long)
while testcase defines
  operator delete(void*)
Honza

[Bug middle-end/106408] PRE with infinite loops

2022-07-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106408

--- Comment #2 from Jan Hubicka  ---
> +  /* If block is a loop that is possibly infinite we should not
> +hoist across it.  */
> +  if (block->loop_father->header == block
> + && !finite_loop_p (block->loop_father))
> +   BB_MAY_NOTRETURN (block) = 1;
> +
Don't you just need to handle also BBs that have backedge out that is
not a latch of loop?
(it is an ugly issue overall)

Honza

[Bug ipa/102581] [12 Regression] ice in forced_merge, at ipa-modref-tree.h:352 with -fno-strict-aliasing and -O2 since r12-3202-gf5ff3a8ed4ca9173

2021-10-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102581

--- Comment #8 from Jan Hubicka  ---
Actually, this is shorter patch - we already should notice that one
range is contained in other, but we give up too early.

Honza

diff --git a/gcc/ipa-modref-tree.h b/gcc/ipa-modref-tree.h
index 6a9ed5ce54b..8e9b89b3e2c 100644
--- a/gcc/ipa-modref-tree.h
+++ b/gcc/ipa-modref-tree.h
@@ -110,8 +110,11 @@ struct GTY(()) modref_access_node
   if (!a.parm_offset_known)
 return false;
   /* Accesses are never below parm_offset, so look
- for smaller offset.  */
-  if (!known_le (parm_offset, a.parm_offset))
+ for smaller offset.
+ If access ranges are known still allow merging
+ when bit offsets comparsion passes.  */
+  if (!known_le (parm_offset, a.parm_offset)
+  && !range_info_useful_p ())
 return false;
   aoffset_adj = (a.parm_offset - parm_offset)
 << LOG2_BITS_PER_UNIT;
@@ -618,6 +621,7 @@ private:
found = true;
  if (!found && n->merge (*a, false))
found = restart = true;
+ gcc_checking_assert (found || !a->merge (*n, false));
  if (found)
{
  accesses->unordered_remove (i);

[Bug ipa/102581] [12 Regression] ice in forced_merge, at ipa-modref-tree.h:352 with -fno-strict-aliasing and -O2 since r12-3202-gf5ff3a8ed4ca9173

2021-10-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102581

--- Comment #7 from Jan Hubicka  ---
Hi,
the problem is that we assume that merge is symmetric (merging a to b
succeeds if and only if merging b to a succeeds). There was one
symetrical path missing in the (fancy and bit ugly) logic on what we can
merge losslessly.

I also added sanity check for this assumption.  So bugs like this should
fire more often.

Honza

diff --git a/gcc/ipa-modref-tree.h b/gcc/ipa-modref-tree.h
index 6a9ed5ce54b..61b4b16dbed 100644
--- a/gcc/ipa-modref-tree.h
+++ b/gcc/ipa-modref-tree.h
@@ -234,6 +234,17 @@ struct GTY(()) modref_access_node
  record_adjustments);
  return true;
}
+ if (known_size_p (a.size)
+ && (!known_size_p (size) || known_lt (size, a.size)))
+   {
+ if (((known_size_p (max_size) || known_size_p (a.max_size))
+  && !known_eq (max_size, a.max_size))
+  || !known_eq (offset1, aoffset1))
+   return false;
+ update (new_parm_offset, offset1, size, max_size,
+ record_adjustments);
+ return true;
+   }
  /* If sizes are same, we can extend the interval.  */
  if ((known_size_p (size) || known_size_p (a.size))
  && !known_eq (size, a.size))
@@ -618,6 +629,7 @@ private:
found = true;
  if (!found && n->merge (*a, false))
found = restart = true;
+ gcc_checking_assert (found || !a->merge (*n, false));
  if (found)
{
  accesses->unordered_remove (i);

[Bug tree-optimization/102446] [9/10/11/12 Regression] wrong code at -O3 on x86_64-linux-gnu

2021-09-22 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102446

--- Comment #4 from Jan Hubicka  ---
> Started with r5-6477-g3620b606822f80863488ca4883542d848d41f9f9
This only affects early inlining decisions, so it may be useful to
bisect this with --param early-inlining-insns=14

Honza

[Bug tree-optimization/45178] CDDCE doesn't eliminate conditional code in infinite loop

2021-08-26 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45178

--- Comment #4 from Jan Hubicka  ---
> and that marks a condition that has nothing to do with loop control.  I 
> suppose
> we can elide this when the loop has no exit (we are already marking backedges
> of irreducible loops).
> 
> But I never actually understood this particular part of CD-DCE.
Skipping this for loops with no exit edges should work - we will end up
removing some control flow inside but the logic finding path to next
live stmt can probably be told to chose any path in empty fully
infinite loop.

Honza

[Bug tree-optimization/101909] 73% regression on tfft benchmark for -O2 -ftree-loop-vectorize compared to -O2 on zen hardware

2021-08-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101909

--- Comment #2 from Jan Hubicka  ---
> So that's znver1 (split AVX IIRC) compared to znver2?
Martin will know how to decode machine names.  I am never sure.
It is with generic, so split AVX does not make difference.

Honza

[Bug testsuite/101902] [12 regression] g++.dg/warn/uninit-1.C has excess errors after r12-2898

2021-08-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101902

--- Comment #1 from Jan Hubicka  ---
Hi,
i am testing

diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c
index 5d7bc800419..d89ab5423cd 100644
--- a/gcc/tree-ssa-uninit.c
+++ b/gcc/tree-ssa-uninit.c
@@ -641,7 +641,7 @@ maybe_warn_pass_by_reference (gcall *stmt, wlimits )
wlims.always_executed = false;

   /* Ignore args we are not going to read from.  */
-  if (gimple_call_arg_flags (stmt, argno - 1) & EAF_UNUSED)
+  if (gimple_call_arg_flags (stmt, argno - 1) & (EAF_UNUSED | EAF_NOREAD))
continue;

   tree arg = gimple_call_arg (stmt, argno - 1);

[Bug middle-end/101829] problems with inline + __attribute__ ((malloc (deallocator)))

2021-08-09 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101829

--- Comment #2 from Jan Hubicka  ---
> It might be possible to inline such functions by creating a "stub" call either
> after or before the inlined function body where the "stub" would just be there
> to represent the attributes.
> 
> Say, inline a 'deallocator' fn
> 
>   my_free (x);
> 
> as
> 
>   
>   .IFN_STUB_DEALLOCATE (x);
> 
> or sth like that where those stubs are removed after any diagnostics with
> regard to pairing of malloc/free have been issued (not sure where that
> currently
> happens).

We have similar issue with functions calling alloca (that we may inline
if was able to insert stack restore).  I was always lazy to implement
this since we need to also insert it on the EH cleanup edges, but
perhaps it is time to learn inliner about such tricks...

Honza

[Bug fortran/100724] -fwhole-program breaks module use

2021-05-25 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100724

--- Comment #6 from Jan Hubicka  ---
> Could the manual entry for -fwhole-program just be amended to clarify that 
> it's
> a fallback for when a linker plugin isn't available for -flto.  That may be
> what it was intended to say, but it's not clear to me.  I used -fwhole-program
> because it seemed to fit my case exactly.

It can be also used in non-lto if your program has only one source file
and FE is not producing duplicated decls...

Honza

[Bug c/100483] Extend -fno-semantic-interposition to global variables

2021-05-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100483

--- Comment #6 from Jan Hubicka  ---
> Thanks for the clarification. I misinterpreted the documentation.
> Then it seems that -fno-semantic-interposition is a very safe optimization for
> distributions to default to. Closing as intended.

Basically -fno-semantic-interposition is OK unless you want to play
tricks like interposing alternative malloc implementations. This comes
very handy for some low-level libraries but I did not see it being used
for interposing say some QT symbols or so.

Honza

[Bug analyzer/98599] [11 Regression] fatal error: Cgraph edge statement index out of range with -Os -flto -fanalyzer

2021-04-13 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98599

--- Comment #15 from Jan Hubicka  ---
> Should be fixed by the above patch, though it's more of a workaround for now;
> am still not sure about what's going on with clones.

I undestand it now.  The problem is that fixup is missed for one gimple
body after statements was renumbered by analyzer.  The reason is
confused check for has_gimple_body_p.   We do not really get out of sync
between uids of functions and its clone, since we stream only clone, but
with the fact that stream in code expects uids to be continuously
increasing.

So apparenlty analyzer is first pass that does use UIDs of statements at
WPA time.

Sorry for taking so long time to understand this.

diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index ceb61bb300b..5903f75ac23 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -306,7 +306,7 @@ lto_wpa_write_files (void)
   cgraph_node *node;
   /* Do body modifications needed for streaming before we fork out
  worker processes.  */
-  FOR_EACH_FUNCTION_WITH_GIMPLE_BODY (node)
+  FOR_EACH_FUNCTION (node)
 if (!node->clone_of && gimple_has_body_p (node->decl))
   lto_prepare_function_for_streaming (node);

[Bug middle-end/99857] [11 Regression] FAIL: libgomp.c/declare-variant-1.c (test for excess errors) by r11-7926

2021-04-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99857

--- Comment #4 from Jan Hubicka  ---
> Honza stated that he's "looking into it",
> .
I do just got distracted by easter.  Problem has to be release_body
happening mid offloading streaming process that is bit odd by itself.
Having smaller testcase is nice - the libgomp one had quite few
release_body calls at compile time.

Honza

[Bug lto/99898] Possible LTO object incompatibility on gcc-10 branch

2021-04-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99898

--- Comment #10 from Jan Hubicka  ---
> Many of the *.opt changes are target specific, so you'd need to test it also
> across all targets, and furthermore it depends on what exactly is being
> saved/restored, many options might be at the same spot.
> So perhaps we want to compute some hash of the options stuff (e.g. compute it
> by the awk scripts that emit options*.[ch]) and use that to determine LTO
> compatibility in addition to the version?

That would work.  One does not really do that in lto header, simply
stream the hash before streaming out the optimization_node decl.
Bit sad would be that w/o version info you have no indication if you
mixed new compiler with old objects or vice versa, but that is minor
anoyance I guess.  It would be good that compiler would just
sorryclaiming that it can not read object files created by different
version..

I believe we already safe a diff from default values rather than
streaming out all values. An option would be tom strea the option names
rather than indexes so adding/removing completely unrelated option does
not disturb the file format.

Honza
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug lto/99898] Possible LTO object incompatibility on gcc-10 branch

2021-04-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99898

--- Comment #8 from Jan Hubicka  ---
> Any *.opt changes can break the streaming of optimization or target option
> nodes.
> And from experience with gcc plugins we have such changes ~ each month even on
> release branches.
It may make sense to add a simple test to our regular testers that
either the new revision can consume old object files or the version was
updated :)

Honza

[Bug lto/99898] Possible LTO object incompatibility on gcc-10 branch

2021-04-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99898

--- Comment #6 from Jan Hubicka  ---
> I only reacall backporting the streaming fixes early in gcc10 timeframe
> (August) that was reason for the September bump.
> Didn't we backport some new command line options/params breaking
> streaming of optimization nodes as usual?

We just few hours after the bump (in common.opt). So there is small
range of revisions where one can produce incompatible objects. But I did
not check lang specific/target specific options.

Honza

[Bug lto/99898] Possible LTO object incompatibility on gcc-10 branch

2021-04-06 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99898

--- Comment #5 from Jan Hubicka  ---
> The LTO minor saw a bump around Sep 10 last year already so the object files
> must be younger or LTO should complain.
> 
> I'm not aware of any specific change where we forgot the bumping but there 
> were
> a lot of changes and since we did already bump bumping again shouldn't cause
> any harm.  Still I'd like to be sure we're not seeing a genuine streaming bug
> here.

I only reacall backporting the streaming fixes early in gcc10 timeframe
(August) that was reason for the September bump.
Didn't we backport some new command line options/params breaking
streaming of optimization nodes as usual?

honza

[Bug ipa/99309] [10/11 Regression] Segmentation fault with __builtin_constant_p usage at -O2

2021-03-31 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309

--- Comment #7 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309
> 
> --- Comment #6 from Jakub Jelinek  ---
> (In reply to Jan Hubicka from comment #5)
> > As discussed, I can prepare patch to make inliner to redirect
> > __builtin_constant_p to __builtin_true whenever inliner detect that the
> > expression is compile time ocnstant.  This will avoid us eventually hitting
> > unreachable when late optimizations forget to make the transformation.
> 
> I'm quite worried about that, the point of guarding something with
> __builtin_constant_p is about the hope that the argument of that builtin will
> evaluate to constant in the conditional block guarded by it.
> By folding __builtin_constant_p to true if it sees it as constant but not
> actually propagating that constant to all uses of that expression we could 
> have
> the condition folded to true, but if SRA or FRE etc. is disabled or isn't able
> to optimize it, we wouldn't have it constant.
Yes, it is not good (this is why I did not implement it earlier).
However you can trigger same problem without inline.  Just put enough
statements between the conditional and use so AO walk oracles gives
up...

Honza

  1   2   >