On 08/11/2016 08:02 AM, Jan Hubicka wrote:
Hi,
this patch adds early jump threading pass.  Jump threading is one of most
common cases where estimated profile becomes corrupted, because the branches
are predicted independently beforehand. This patch simply adds early mode to
jump threader that does not permit code growth and thus only win/win threads
are done before profile is constructed.

Naturally this also improves IPA decisions because code size/time is estimated
more acurately.
Excellent. One of the goals here was to enable you to run it early, so I'm glad to see it's working out.


It is not very cool to add another pass and the jump threader is already
run 5 times. I think incrementally we could drop one of late threaders at least.
I tried to measure compile time effects but they are in wash. Moreover the patch
pays for itself in cc1plus code size:
Most definitely we want to be dropping calls into the later passes. There's analysis to do, but the goal is to drop the old style threading from DOM/VRP completely. It may also be the case that one or more passes of the backwards/FSM threader can be avoided, but we should look at removing the DOM/VRP threading first.


Before patch to tweak size estimates: 27779964
Current mainline:                     27748900
With patch applied:                   27716173

So despite adding few functions the code size effect is positive which I think
is quite nice.

Given the fact that jump threading seems quite lightweit, I wonder why it is
guarded by flag_expensive_optimizations? Is it expensive in terms of code
size?
The DOM/VRP jump threading used to be very expensive because of iteration and the ssa updates. Both of those issues have since been addressed. The backwards/FSM threader is based on an algorithm that IIRC is cubic, and we also have some implementation issues that cause it to use far more time than it should.

However, in an early mode, we don't want to walk backwards through the CFG very far (if at all) and for an early mode we may not want to guard on flag_expensive_optimizations.


The effectivity of individual threading passes is as follows (for tramp3d)

      mainline                              with patch
pass  thread count     profile mismatches   thread count    profile mismatch
early                                       525
1     1853             1900                 316             337
2     4                812                  4               288
3     24               1450                 32              947
4     76               1457                 75              975

So at least tramp3d data seems to suggest that we can drop the second occurence
of jump threading and that most of the job thread1 does can be done by the
size restricted early version (the lower absolute counts are caused by the
fact that threadable paths gets duplicated by the inliner). 50% drop in
profile distortion is not bad. I wonder why basically every threaded paths seems
to introduce a mismatch.
I don't think the backwards/FSM threader tries to update the profile data at all right now.


The patch distorts testusite somewhat, in most cases we only want to update
dump files or disable early threading:

+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
+FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
+FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 
1
+FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around 
loop and would copy too many statements"
+FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps 
threaded: 1" 1
+FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
+FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 
1" 1
+FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
+FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"

This testcase is the now probably unnecesary heuristics (it may still be
relevant in cases we do not thread because of size bounds but its effectivity
may not be worth the maintenance cost):

+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate 
"loop exit heuristics of edge[^:]*:" 3

If the patch seems acceptable, I will do the updates. One option why I did
not do that is that it seems to be now posisble to pass parameters to passes
from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
consistent with way we handle other early passes.

Bootstrapped/regtested x86_64-linux
Honza

        * passes.def (pass_early_thread_jumps): Schedule after forwprop.
        * tree-pass.h (make_pass_early_thread_jumps): Declare.
        * tree-ssa-threadbackward.c (fsm_find_thread_path,
        fsm_find_thread_path, profitable_jump_thread_path,
        fsm_find_control_statement_thread_paths,
        find_jump_threads_backwards): Add speed_p parameter.
        (pass_data_early_thread_jumps): New pass.
        (make_pass_early_thread_jumps): New function.
I'll try to take a look at the details shortly.

jeff

Reply via email to