[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-12-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #11 from Richard Biener  ---
(In reply to Richard Biener from comment #10)
> Re-confirmed.  =15 vs =30 goes from
> 
>  backwards jump threading   :   0.58 ( 13%)
> 
> to
> 
>  backwards jump threading   :   7.00 ( 65%)
> 
> so it still shows exponential behavior for CFGs like
> 
>  if (a)
>...
>  if (b)
>...
>  if (c)
>...
> 
> because we explore all paths through the CFG that fit in some size limit.
> I'm not sure if there's any low hanging fruit besides this (like if we
> properly avoid re-doing local computes when visiting blocks as part of a
> path multiple times).

We do have --param max-jump-thread-paths putting a hard limit on the number
of branches in a paths we explore, the default might just be a bit high
for this testcase.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-12-13 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #10 from Richard Biener  ---
Re-confirmed.  =15 vs =30 goes from

 backwards jump threading   :   0.58 ( 13%)

to

 backwards jump threading   :   7.00 ( 65%)

so it still shows exponential behavior for CFGs like

 if (a)
   ...
 if (b)
   ...
 if (c)
   ...

because we explore all paths through the CFG that fit in some size limit.
I'm not sure if there's any low hanging fruit besides this (like if we
properly avoid re-doing local computes when visiting blocks as part of a
path multiple times).

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|12.2|12.3

--- Comment #9 from Richard Biener  ---
GCC 12.2 is being released, retargeting bugs to GCC 12.3.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-11 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #8 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:16b013c9d9b4d950f89821476e791bf18c1295df

commit r13-2020-g16b013c9d9b4d950f89821476e791bf18c1295df
Author: Richard Biener 
Date:   Tue Aug 9 13:37:26 2022 +0200

tree-optimization/106514 - revisit m_import compute in backward threading

This revisits how we compute imports later used for the ranger path
query during backwards threading.  The compute_imports function
of the path solver ends up pulling the SSA def chain of regular
stmts without limit and since it starts with just the gori imports
of the path exit it misses some interesting names to translate
during path discovery.  In fact with a still empty path this
compute_imports function looks like not the correct tool.

The following instead implements what it does during the path discovery
and since we add the exit block we seed the initial imports and
interesting names from just the exit conditional.  When we then
process interesting names (aka imports we did not yet see the definition
of) we prune local defs but add their uses in a similar way as
compute_imports would have done.

compute_imports also is lacking in its walking of the def chain
compared to range_def_chain::get_def_chain which for example
handles &_1->x specially through range_op_handler and
gimple_range_operand1, so the code copies this.  A fix for
compute_imports will be done separately, also fixing the unbound
walk there.

The patch also properly unwinds m_imports during the path discovery
backtracking and from a debugging session I have verified the two
sets evolve as expected now while previously behaving slightly erratic.

Fortunately the m_imports set now also is shrunken significantly for
the PR69592 testcase (aka PR106514) so that there's overall speedup
when increasing --param max-jump-thread-duplication-stmts as
15 -> 30 -> 60 -> 120 from 1s -> 2s -> 13s -> 27s to with the patch
1s -> 2s -> 4s -> 8s.

This runs into a latent issue in X which doesn't seem to expect
any PHI nodes with a constant argument on an edge inside the path.
But we now have those as interesting, for example for the ICEing
g++.dg/torture/pr100925.C which just has sth like

  if (i)
x = 1;
  else
x = 5;
  if (x == 1)
...

where we now have the path from if (i) to if (x) and the PHI for x
in the set of imports to consider for resolving x == 1 which IMHO
looks exactly like what we want.  The path_range_query::ssa_range_in_phi
papers over the issue and drops the range to varying instead of
crashing.  I didn't want to mess with this any further in this patch
(but I couldn't resist replacing the loop over PHI args with
PHI_ARG_DEF_FROM_EDGE, so mind the re-indenting).

PR tree-optimization/106514
* tree-ssa-threadbackward.cc (back_threader::find_paths_to_names):
Compute and unwind both m_imports and interesting on the fly during
path discovery.
(back_threader::find_paths): Compute the original m_imports
from just the SSA uses of the exit conditional.  Drop
handling single_succ_to_potentially_threadable_block.
* gimple-range-path.cc (path_range_query::ssa_range_in_phi): Handle
constant PHI arguments without crashing.  Use
PHI_ARG_DEF_FROM_EDGE.

* gcc.dg/tree-ssa/ssa-thread-19.c: Un-XFAIL.
* gcc.dg/tree-ssa/ssa-thread-20.c: New testcase.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2022-08-10
 Ever confirmed|0   |1
   Priority|P3  |P2
 Status|UNCONFIRMED |NEW

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #7 from Richard Biener  ---
For the testcase m_imports is so big because we have

...
   [local count: 1073741824]:
  # c_1198 = PHI 
  _599 = MEM[(unsigned int *)b_1201(D) + 2792B];
  d_2401 = _599 + d_2399;
  if (d_2399 > d_2401)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:
  c_2402 = c_1198 + 1;

   [local count: 1073741824]:
  # c_1199 = PHI 
  _600 = MEM[(unsigned int *)b_1201(D) + 2796B];
  d_2403 = _600 + d_2401;
  if (d_2401 > d_2403)
goto ; [50.00%]
  else
goto ; [50.00%]

so when back_threader::find_paths does ->compute_imports (.., bb 1200) we
walk up the whole d_2403 definition chain unbound (for PHIs we restrict
to edges on the path which is empty).  I realize that there's no good way
to pick up extra imports on the fly cheaply - we could handle it when
we prune local defs from the imports at which point we could add operands
but it's not clear to me that will be a good trade-off.  In fact
pruning imports looks suspicious as the final path-range query will
be limited there?  Likewise for any import we add via PHI-translation
we fail to add local def operands - we're only getting those from the
initial import compute which basically picks those from blocks dominating
the exit but no others.

I will experiment with re-wiring this.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #6 from Richard Biener  ---
So one now needs to bump the limit to 60 to get enough samples for perf.  Then
we now see

Samples: 55K of event 'cycles:u', Event count (approx.): 49013411833
Overhead   Samples  Command  Shared Object Symbol   
  51.19% 28195  cc1  cc1   [.]
path_range_query::compute_ranges_in_block
  11.67%  6427  cc1  cc1   [.]
path_range_query::adjust_for_non_null_uses
   9.20%  5069  cc1  cc1   [.]
path_range_query::range_defined_in_block
   3.39%  1869  cc1  cc1   [.] bitmap_set_bit
   1.95%  1072  cc1  cc1   [.]
back_threader::find_paths_to_names
   1.93%  1066  cc1  cc1   [.] bitmap_bit_p

the compute_ranges_in_block is also top with 30 but adjust_for_non_null_uses
pops up newly with 60.

The compute_ranges_in_block slowness is attributed to

  // ...and then the rest of the imports.
  EXECUTE_IF_SET_IN_BITMAP (m_imports, 0, i, bi)
{
  tree name = ssa_name (i);
  Value_Range r (TREE_TYPE (name));

  if (gimple_code (SSA_NAME_DEF_STMT (name)) != GIMPLE_PHI
  && range_defined_in_block (r, name, bb))


plus

  gori_compute  = m_ranger->gori ();
  bitmap exports = g.exports (bb);
  EXECUTE_IF_AND_IN_BITMAP (m_imports, exports, 0, i, bi)
{
  tree name = ssa_name (i);
  Value_Range r (TREE_TYPE (name));
  if (g.outgoing_edge_range_p (r, e, name, *this))


for this testcase there seem to be a lot of imports but not many exports
so range_defined_in_block is called very many times compared to
outgoing_edge_range_p but the latter is comparatively more expensive.

For the path query I wonder why we are interested in computing (aka
updating the cache) for any but the exports?  When we
compute the exports, why is the cache not lazily computed just for
the interesting names?  AFAICS we invalidate all local defs (but even
then, why?  we get to see a def exactly once, why do we have to even
think about clearing sth we should not have seen?)

That is, in path_range_query::compute_ranges

  while (1)
{
  basic_block bb = curr_bb ();

  compute_ranges_in_block (bb);
  adjust_for_non_null_uses (bb);

  if (at_exit ())
break;

  move_next ();
}

I'd expect only a small portion of the actual compute_ranges_in_block
work to be done for all blocks and the real resolving work only for
the block ending the path?  Maybe the backwards threader is just using
the wrong (expensive) API here?  It does

 m_solver->compute_ranges (path, m_imports);
 m_solver->range_of_stmt (r, cond);


--

Btw, I wondered if path-range-query can handle parts of the path being a
"black box", aka, skip to the immediate dominator instead of one of the
predecessor edges?  I _think_ analysis wise this would be quite straight
forward but of course we'd have to represent this somehow in the path.
Maybe it works by simply leaving out the intermediate blocks?  Thus,

   B
   |\
   A
  / \
 C   D
  \ /
   E
\

the path would be from B to E but we don't care whether we go the C or D
way, and when duplicating the path we'd simply duplicate the whole diamond
instead of duplicating only one branch, say A->D, and keeping the edge
A->C to the original block C, defeating the threading of E to its successor
if we happen to go that way.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #5 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:409978d58dafa689c5b3f85013e2786526160f2c

commit r13-1998-g409978d58dafa689c5b3f85013e2786526160f2c
Author: Richard Biener 
Date:   Mon Aug 8 12:20:04 2022 +0200

tree-optimization/106514 - add --param max-jump-thread-paths

The following adds a limit for the exponential greedy search of
the backwards jump threader.  The idea is to limit the search
space in a way that the paths considered are the same if the search
were in BFS order rather than DFS.  In particular it stops considering
incoming edges into a block if the product of the in-degrees of
blocks on the path exceeds the specified limit.

When considering the low stmt copying limit of 7 (or 1 in the size
optimize case) this means the degenerate case with maximum search
space is a sequence of conditions with no actual code

  B1
   |\
   | empty
   |/
  B2
   |\
   ...
  Bn
   |\

GIMPLE_CONDs are costed 2, an equivalent GIMPLE_SWITCH already 4, so
we reach 7 already with 3 middle conditions (B1 and Bn do not count).
The search space would be 2^4 == 16 to reach this.  The FSM threads
historically allowed for a thread length of 10 but is really looking
for a single multiway branch threaded across the backedge.  I've
chosen the default of the new parameter to 64 which effectively
limits the outdegree of the switch statement (the cases reaching the
backedge) to that number (divided by 2 until I add some special
pruning for FSM threads due to the loop header indegree).  The
testcase ssa-dom-thread-7.c requires 56 at the moment (as said,
some special FSM thread pruning of considered edges would bring
it down to half of that), but we now get one more threading
and quite some more in later threadfull.  This testcase seems to
be difficult to check for expected transforms.

The new testcases add the degenerate case we currently thread
(without deciding whether that's a good idea ...) plus one with
an approripate limit that should prevent the threading.

This obsoletes the mentioned --param max-fsm-thread-length but
I am not removing it as part of this patch.  When the search
space is limited the thread stmt size limit effectively provides
max-fsm-thread-length.

The param with its default does not help PR106514 enough to unleash
path searching with the higher FSM stmt count limit.

PR tree-optimization/106514
* params.opt (max-jump-thread-paths): New.
* doc/invoke.texi (max-jump-thread-paths): Document.
* tree-ssa-threadbackward.cc (back_threader::find_paths_to_names):
Honor max-jump-thread-paths, take overall_path argument.
(back_threader::find_paths): Pass 1 as initial overall_path.

* gcc.dg/tree-ssa/ssa-thread-16.c: New testcase.
* gcc.dg/tree-ssa/ssa-thread-17.c: Likewise.
* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #4 from CVS Commits  ---
The master branch has been updated by Aldy Hernandez :

https://gcc.gnu.org/g:47964e766270f349f5b171bcd68ff7c1e60d85d8

commit r13-1973-g47964e766270f349f5b171bcd68ff7c1e60d85d8
Author: Aldy Hernandez 
Date:   Fri Aug 5 08:04:10 2022 +0200

Inline unsupported_range constructor.

An unsupported_range temporary is instantiated in every Value_Range
for completeness sake and should be mostly a NOP.  However, it's
showing up in the callgrind stats, because it's not inline.  This
fixes the oversight.

PR tree-optimization/106514

gcc/ChangeLog:

* value-range.cc (unsupported_range::unsupported_range): Move...
* value-range.h (unsupported_range::unsupported_range): ...here.
(unsupported_range::set_undefined): New.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-04 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #3 from CVS Commits  ---
The master branch has been updated by Andrew Macleod :

https://gcc.gnu.org/g:8e34d92ef29a175b84cc7f5185db43656ae762bb

commit r13-1965-g8e34d92ef29a175b84cc7f5185db43656ae762bb
Author: Andrew MacLeod 
Date:   Thu Aug 4 12:22:59 2022 -0400

Loop over intersected bitmaps.

compute_ranges_in_block loops over the import list and then checks the
same bit in exports.  It is nmore efficent to loop over the intersection
of the 2 bitmaps.

PR tree-optimization/106514
* gimple-range-path.cc (path_range_query::compute_ranges_in_block):
Use EXECUTE_IF_AND_IN_BITMAP to loop over 2 bitmaps.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-03 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #1 from Andrew Macleod  ---
(In reply to Richard Biener from comment #0)

> 
> Part of the sub-optimality is probably the equiv chain becoming very long
> (can we simply limit that?) and clearing bits in all the very many
> bitmaps linked.  N

we certainly could.  Especially in the path version which has that killing-def
issue which doesn't exist in the normal oracle.   The path oracle basically
takes a normal oracle, then bolts the path following code onto it, and has to
deal with new defintions invalidating any existing equivalences.  I'd first
look for inefficiencies elsewhere as we didnt spend a lot of time tweaking it
once it was working.

Given the way equivalences have to be matched, Im not sure that we even need to
walk the list.  The new equivalence set for the killed def will only contain
itself, or any new equivalences encountered since the kill.   In order to be
equivalent, 2 names must be in each others set, which they won't be.  I'm not
convinced we need to remove them at all.

Im also not sure why the path oracle changes the root oracle requirement that
they be the same equivalence set, not just in each others. I think it has
something to do with the  transitory nature of the path equivalence/relations
vs the root oracles "permanent" sets.  I think we can do better here too.

And finally, Aldy has a list of all the ssa-names in the path that are relevant
to the calculations in the path.  I suspect we can reduce any equivalence sets
immediately to just those names, as any on-entry ranges should reflect existing
equivalences.  in theory :-)

We'll see if any or all of those have any effect and get back to you.

[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query

2022-08-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |12.2
 CC||amacleod at redhat dot com
   Keywords||compile-time-hog