[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #11 from Richard Biener --- (In reply to Richard Biener from comment #10) > Re-confirmed. =15 vs =30 goes from > > backwards jump threading : 0.58 ( 13%) > > to > > backwards jump threading : 7.00 ( 65%) > > so it still shows exponential behavior for CFGs like > > if (a) >... > if (b) >... > if (c) >... > > because we explore all paths through the CFG that fit in some size limit. > I'm not sure if there's any low hanging fruit besides this (like if we > properly avoid re-doing local computes when visiting blocks as part of a > path multiple times). We do have --param max-jump-thread-paths putting a hard limit on the number of branches in a paths we explore, the default might just be a bit high for this testcase.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #10 from Richard Biener --- Re-confirmed. =15 vs =30 goes from backwards jump threading : 0.58 ( 13%) to backwards jump threading : 7.00 ( 65%) so it still shows exponential behavior for CFGs like if (a) ... if (b) ... if (c) ... because we explore all paths through the CFG that fit in some size limit. I'm not sure if there's any low hanging fruit besides this (like if we properly avoid re-doing local computes when visiting blocks as part of a path multiple times).
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 Richard Biener changed: What|Removed |Added Target Milestone|12.2|12.3 --- Comment #9 from Richard Biener --- GCC 12.2 is being released, retargeting bugs to GCC 12.3.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #8 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:16b013c9d9b4d950f89821476e791bf18c1295df commit r13-2020-g16b013c9d9b4d950f89821476e791bf18c1295df Author: Richard Biener Date: Tue Aug 9 13:37:26 2022 +0200 tree-optimization/106514 - revisit m_import compute in backward threading This revisits how we compute imports later used for the ranger path query during backwards threading. The compute_imports function of the path solver ends up pulling the SSA def chain of regular stmts without limit and since it starts with just the gori imports of the path exit it misses some interesting names to translate during path discovery. In fact with a still empty path this compute_imports function looks like not the correct tool. The following instead implements what it does during the path discovery and since we add the exit block we seed the initial imports and interesting names from just the exit conditional. When we then process interesting names (aka imports we did not yet see the definition of) we prune local defs but add their uses in a similar way as compute_imports would have done. compute_imports also is lacking in its walking of the def chain compared to range_def_chain::get_def_chain which for example handles &_1->x specially through range_op_handler and gimple_range_operand1, so the code copies this. A fix for compute_imports will be done separately, also fixing the unbound walk there. The patch also properly unwinds m_imports during the path discovery backtracking and from a debugging session I have verified the two sets evolve as expected now while previously behaving slightly erratic. Fortunately the m_imports set now also is shrunken significantly for the PR69592 testcase (aka PR106514) so that there's overall speedup when increasing --param max-jump-thread-duplication-stmts as 15 -> 30 -> 60 -> 120 from 1s -> 2s -> 13s -> 27s to with the patch 1s -> 2s -> 4s -> 8s. This runs into a latent issue in X which doesn't seem to expect any PHI nodes with a constant argument on an edge inside the path. But we now have those as interesting, for example for the ICEing g++.dg/torture/pr100925.C which just has sth like if (i) x = 1; else x = 5; if (x == 1) ... where we now have the path from if (i) to if (x) and the PHI for x in the set of imports to consider for resolving x == 1 which IMHO looks exactly like what we want. The path_range_query::ssa_range_in_phi papers over the issue and drops the range to varying instead of crashing. I didn't want to mess with this any further in this patch (but I couldn't resist replacing the loop over PHI args with PHI_ARG_DEF_FROM_EDGE, so mind the re-indenting). PR tree-optimization/106514 * tree-ssa-threadbackward.cc (back_threader::find_paths_to_names): Compute and unwind both m_imports and interesting on the fly during path discovery. (back_threader::find_paths): Compute the original m_imports from just the SSA uses of the exit conditional. Drop handling single_succ_to_potentially_threadable_block. * gimple-range-path.cc (path_range_query::ssa_range_in_phi): Handle constant PHI arguments without crashing. Use PHI_ARG_DEF_FROM_EDGE. * gcc.dg/tree-ssa/ssa-thread-19.c: Un-XFAIL. * gcc.dg/tree-ssa/ssa-thread-20.c: New testcase.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 Richard Biener changed: What|Removed |Added Last reconfirmed||2022-08-10 Ever confirmed|0 |1 Priority|P3 |P2 Status|UNCONFIRMED |NEW
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #7 from Richard Biener --- For the testcase m_imports is so big because we have ... [local count: 1073741824]: # c_1198 = PHI _599 = MEM[(unsigned int *)b_1201(D) + 2792B]; d_2401 = _599 + d_2399; if (d_2399 > d_2401) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: c_2402 = c_1198 + 1; [local count: 1073741824]: # c_1199 = PHI _600 = MEM[(unsigned int *)b_1201(D) + 2796B]; d_2403 = _600 + d_2401; if (d_2401 > d_2403) goto ; [50.00%] else goto ; [50.00%] so when back_threader::find_paths does ->compute_imports (.., bb 1200) we walk up the whole d_2403 definition chain unbound (for PHIs we restrict to edges on the path which is empty). I realize that there's no good way to pick up extra imports on the fly cheaply - we could handle it when we prune local defs from the imports at which point we could add operands but it's not clear to me that will be a good trade-off. In fact pruning imports looks suspicious as the final path-range query will be limited there? Likewise for any import we add via PHI-translation we fail to add local def operands - we're only getting those from the initial import compute which basically picks those from blocks dominating the exit but no others. I will experiment with re-wiring this.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #6 from Richard Biener --- So one now needs to bump the limit to 60 to get enough samples for perf. Then we now see Samples: 55K of event 'cycles:u', Event count (approx.): 49013411833 Overhead Samples Command Shared Object Symbol 51.19% 28195 cc1 cc1 [.] path_range_query::compute_ranges_in_block 11.67% 6427 cc1 cc1 [.] path_range_query::adjust_for_non_null_uses 9.20% 5069 cc1 cc1 [.] path_range_query::range_defined_in_block 3.39% 1869 cc1 cc1 [.] bitmap_set_bit 1.95% 1072 cc1 cc1 [.] back_threader::find_paths_to_names 1.93% 1066 cc1 cc1 [.] bitmap_bit_p the compute_ranges_in_block is also top with 30 but adjust_for_non_null_uses pops up newly with 60. The compute_ranges_in_block slowness is attributed to // ...and then the rest of the imports. EXECUTE_IF_SET_IN_BITMAP (m_imports, 0, i, bi) { tree name = ssa_name (i); Value_Range r (TREE_TYPE (name)); if (gimple_code (SSA_NAME_DEF_STMT (name)) != GIMPLE_PHI && range_defined_in_block (r, name, bb)) plus gori_compute = m_ranger->gori (); bitmap exports = g.exports (bb); EXECUTE_IF_AND_IN_BITMAP (m_imports, exports, 0, i, bi) { tree name = ssa_name (i); Value_Range r (TREE_TYPE (name)); if (g.outgoing_edge_range_p (r, e, name, *this)) for this testcase there seem to be a lot of imports but not many exports so range_defined_in_block is called very many times compared to outgoing_edge_range_p but the latter is comparatively more expensive. For the path query I wonder why we are interested in computing (aka updating the cache) for any but the exports? When we compute the exports, why is the cache not lazily computed just for the interesting names? AFAICS we invalidate all local defs (but even then, why? we get to see a def exactly once, why do we have to even think about clearing sth we should not have seen?) That is, in path_range_query::compute_ranges while (1) { basic_block bb = curr_bb (); compute_ranges_in_block (bb); adjust_for_non_null_uses (bb); if (at_exit ()) break; move_next (); } I'd expect only a small portion of the actual compute_ranges_in_block work to be done for all blocks and the real resolving work only for the block ending the path? Maybe the backwards threader is just using the wrong (expensive) API here? It does m_solver->compute_ranges (path, m_imports); m_solver->range_of_stmt (r, cond); -- Btw, I wondered if path-range-query can handle parts of the path being a "black box", aka, skip to the immediate dominator instead of one of the predecessor edges? I _think_ analysis wise this would be quite straight forward but of course we'd have to represent this somehow in the path. Maybe it works by simply leaving out the intermediate blocks? Thus, B |\ A / \ C D \ / E \ the path would be from B to E but we don't care whether we go the C or D way, and when duplicating the path we'd simply duplicate the whole diamond instead of duplicating only one branch, say A->D, and keeping the edge A->C to the original block C, defeating the threading of E to its successor if we happen to go that way.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #5 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:409978d58dafa689c5b3f85013e2786526160f2c commit r13-1998-g409978d58dafa689c5b3f85013e2786526160f2c Author: Richard Biener Date: Mon Aug 8 12:20:04 2022 +0200 tree-optimization/106514 - add --param max-jump-thread-paths The following adds a limit for the exponential greedy search of the backwards jump threader. The idea is to limit the search space in a way that the paths considered are the same if the search were in BFS order rather than DFS. In particular it stops considering incoming edges into a block if the product of the in-degrees of blocks on the path exceeds the specified limit. When considering the low stmt copying limit of 7 (or 1 in the size optimize case) this means the degenerate case with maximum search space is a sequence of conditions with no actual code B1 |\ | empty |/ B2 |\ ... Bn |\ GIMPLE_CONDs are costed 2, an equivalent GIMPLE_SWITCH already 4, so we reach 7 already with 3 middle conditions (B1 and Bn do not count). The search space would be 2^4 == 16 to reach this. The FSM threads historically allowed for a thread length of 10 but is really looking for a single multiway branch threaded across the backedge. I've chosen the default of the new parameter to 64 which effectively limits the outdegree of the switch statement (the cases reaching the backedge) to that number (divided by 2 until I add some special pruning for FSM threads due to the loop header indegree). The testcase ssa-dom-thread-7.c requires 56 at the moment (as said, some special FSM thread pruning of considered edges would bring it down to half of that), but we now get one more threading and quite some more in later threadfull. This testcase seems to be difficult to check for expected transforms. The new testcases add the degenerate case we currently thread (without deciding whether that's a good idea ...) plus one with an approripate limit that should prevent the threading. This obsoletes the mentioned --param max-fsm-thread-length but I am not removing it as part of this patch. When the search space is limited the thread stmt size limit effectively provides max-fsm-thread-length. The param with its default does not help PR106514 enough to unleash path searching with the higher FSM stmt count limit. PR tree-optimization/106514 * params.opt (max-jump-thread-paths): New. * doc/invoke.texi (max-jump-thread-paths): Document. * tree-ssa-threadbackward.cc (back_threader::find_paths_to_names): Honor max-jump-thread-paths, take overall_path argument. (back_threader::find_paths): Pass 1 as initial overall_path. * gcc.dg/tree-ssa/ssa-thread-16.c: New testcase. * gcc.dg/tree-ssa/ssa-thread-17.c: Likewise. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #4 from CVS Commits --- The master branch has been updated by Aldy Hernandez : https://gcc.gnu.org/g:47964e766270f349f5b171bcd68ff7c1e60d85d8 commit r13-1973-g47964e766270f349f5b171bcd68ff7c1e60d85d8 Author: Aldy Hernandez Date: Fri Aug 5 08:04:10 2022 +0200 Inline unsupported_range constructor. An unsupported_range temporary is instantiated in every Value_Range for completeness sake and should be mostly a NOP. However, it's showing up in the callgrind stats, because it's not inline. This fixes the oversight. PR tree-optimization/106514 gcc/ChangeLog: * value-range.cc (unsupported_range::unsupported_range): Move... * value-range.h (unsupported_range::unsupported_range): ...here. (unsupported_range::set_undefined): New.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #3 from CVS Commits --- The master branch has been updated by Andrew Macleod : https://gcc.gnu.org/g:8e34d92ef29a175b84cc7f5185db43656ae762bb commit r13-1965-g8e34d92ef29a175b84cc7f5185db43656ae762bb Author: Andrew MacLeod Date: Thu Aug 4 12:22:59 2022 -0400 Loop over intersected bitmaps. compute_ranges_in_block loops over the import list and then checks the same bit in exports. It is nmore efficent to loop over the intersection of the 2 bitmaps. PR tree-optimization/106514 * gimple-range-path.cc (path_range_query::compute_ranges_in_block): Use EXECUTE_IF_AND_IN_BITMAP to loop over 2 bitmaps.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 --- Comment #1 from Andrew Macleod --- (In reply to Richard Biener from comment #0) > > Part of the sub-optimality is probably the equiv chain becoming very long > (can we simply limit that?) and clearing bits in all the very many > bitmaps linked. N we certainly could. Especially in the path version which has that killing-def issue which doesn't exist in the normal oracle. The path oracle basically takes a normal oracle, then bolts the path following code onto it, and has to deal with new defintions invalidating any existing equivalences. I'd first look for inefficiencies elsewhere as we didnt spend a lot of time tweaking it once it was working. Given the way equivalences have to be matched, Im not sure that we even need to walk the list. The new equivalence set for the killed def will only contain itself, or any new equivalences encountered since the kill. In order to be equivalent, 2 names must be in each others set, which they won't be. I'm not convinced we need to remove them at all. Im also not sure why the path oracle changes the root oracle requirement that they be the same equivalence set, not just in each others. I think it has something to do with the transitory nature of the path equivalence/relations vs the root oracles "permanent" sets. I think we can do better here too. And finally, Aldy has a list of all the ssa-names in the path that are relevant to the calculations in the path. I suspect we can reduce any equivalence sets immediately to just those names, as any on-entry ranges should reflect existing equivalences. in theory :-) We'll see if any or all of those have any effect and get back to you.
[Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514 Richard Biener changed: What|Removed |Added Target Milestone|--- |12.2 CC||amacleod at redhat dot com Keywords||compile-time-hog