On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford 
<richard.sandif...@arm.com> wrote:
>Richard Biener <rguent...@suse.de> writes:
>> On Thu, 18 Oct 2018, Richard Sandiford wrote:
>>
>>> Richard Biener <rguent...@suse.de> writes:
>>> > PR63155 made me pick up this old work from Steven, it turns our
>>> > linked-list implementation to a two-mode one with one being a
>>> > splay tree featuring O(log N) complexity for find/remove.
>>> >
>>> > Over Stevens original patch I added a bitmap_tree_to_vec helper
>>> > that I use from the debug/print methods to avoid changing view
>>> > there.  In theory the bitmap iterator could get a "stack"
>>> > as well and we could at least support EXECUTE_IF_SET_IN_BITMAP.
>>> >
>>> > This can be used to fix the two biggest bottlenecks in the PRs
>>> > testcase, namely SSA propagator worklist handling and out-of-SSA
>>> > coalesce list building.  perf shows the following data, first
>>> > unpatched, second patched - also watch the thrid coulumn (samples)
>>> > when comparing percentages.
>>> >
>>> > -O0
>>> > -   18.19%    17.35%           407  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 8.77% create_coalesce_list_for_region                     
>            ▒
>>> >       + 4.21% calculate_live_ranges                               
>            ▒
>>> >       + 2.02% build_ssa_conflict_graph                            
>            ▒
>>> >       + 1.66% insert_phi_nodes_for                                
>            ▒
>>> >       + 0.86% coalesce_ssa_name                      
>>> > patched:
>>> > -   12.39%    10.48%           129  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 5.27% calculate_live_ranges                               
>            ▒
>>> >       + 2.76% insert_phi_nodes_for                                
>            ▒
>>> >       + 1.90% create_coalesce_list_for_region                     
>            ▒
>>> >       + 1.63% build_ssa_conflict_graph                            
>            ▒
>>> >       + 0.35% coalesce_ssa_name                           
>>> >
>>> > -O1
>>> > -   17.53%    17.53%           842  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 12.39% add_ssa_edge                                       
>            ▒
>>> >       + 1.48% create_coalesce_list_for_region                     
>            ▒
>>> >       + 0.82% solve_constraints                                   
>            ▒
>>> >       + 0.71% calculate_live_ranges                               
>            ▒
>>> >       + 0.64% add_implicit_graph_edge                             
>            ▒
>>> >       + 0.41% insert_phi_nodes_for                                
>            ▒
>>> >       + 0.34% build_ssa_conflict_graph                      
>>> > patched:
>>> > -    5.79%     5.00%           167  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 1.41% add_ssa_edge                                        
>            ▒
>>> >       + 0.88% calculate_live_ranges                               
>            ▒
>>> >       + 0.75% add_implicit_graph_edge                             
>            ▒
>>> >       + 0.68% solve_constraints                                   
>            ▒
>>> >       + 0.48% insert_phi_nodes_for                                
>            ▒
>>> >       + 0.45% build_ssa_conflict_graph                   
>>> >
>>> > -O3
>>> > -   12.37%    12.34%          1145  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 9.14% add_ssa_edge                                        
>            ▒
>>> >       + 0.80% create_coalesce_list_for_region                     
>            ▒
>>> >       + 0.69% add_implicit_graph_edge                             
>            ▒
>>> >       + 0.54% solve_constraints                                   
>            ▒
>>> >       + 0.34% calculate_live_ranges                               
>            ▒
>>> >       + 0.27% insert_phi_nodes_for                                
>            ▒
>>> >       + 0.21% build_ssa_conflict_graph                 
>>> > -    4.36%     3.86%           227  cc1      cc1               [.]
>bitmap_set_b▒
>>> >    - bitmap_set_bit                                               
>            ▒
>>> >       + 0.98% add_ssa_edge                                        
>            ▒
>>> >       + 0.86% add_implicit_graph_edge                             
>            ▒
>>> >       + 0.64% solve_constraints                                   
>            ▒
>>> >       + 0.57% calculate_live_ranges                               
>            ▒
>>> >       + 0.32% build_ssa_conflict_graph                            
>            ▒
>>> >       + 0.29% mark_all_vars_used_1                                
>            ▒
>>> >       + 0.20% insert_phi_nodes_for                                
>            ▒
>>> >       + 0.16% create_coalesce_list_for_region             
>>> >
>>> >
>>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>> 
>>> What's the performance like for more reasonable testcases?  E.g. is
>there
>>> a measurable change either way in --enable-checking=release for some
>gcc
>>> .iis at -g or -O2 -g?
>>
>> I did a quick check on my set of cc1files (still .i, .ii ones tend to
>> be unusable quite quickly...).  Most of them compile too quickly
>> to make any difference appear other than noise.  Multi-second
>differences
>> like for PR63155 should be the exception but our O(n) bitmap
>> implementation really makes some parts of GCC quadratic where it
>> doesn't appear so.
>>
>> Is there a reason you expect it to be ever slower?
>
>During recent stage3s I've tried to look at profiles of cc1plus
>to see whether there was something easy we could do to improve
>compile times.  And bitmap operations always showed up near the
>top of the profile.  There were always some pathological queries
>in which the linear search really hurt, but whenever I tried "simple"
>ways to avoid the obvious cases, they made those queries quicker
>but slowed things down overall.  It seemed that adding any extra logic
>to the queries hurted because even a small slowdown in common lookups
>overwhelmed a big saving in less common lookups.

Yeah. I also noticed some 'obvious' shortcomings in the heuristics...
I guess in the end well predicted branches in the out of line code are 
important... 

>
>But there again I was looking to speed up more "normal" cases, not ones
>like the PR.
>
>FWIW I've tried it on a local x86_64 box and it slows down current
>optabs.ii at -O2 -g by ~0.35% (reproducable).  I see similar slowdowns
>for the other files I tried.  But that's hardly a huge amount, and
>probably a price worth paying for the speed-up in the PR.

Just to make sure what to reproduce - this is with checking disabled? And with 
or without the hunks actually making use of the splay tree path? 

Richard. 

>Thanks,
>Richard

Reply via email to