[Bug tree-optimization/65478] New: crafty performance regression

hubicka at gcc dot gnu.org Thu, 19 Mar 2015 15:12:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478


            Bug ID: 65478
           Summary: crafty performance regression
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org

As reported to me privately by Igor, crafty SPEC2k benchmark is slower since
r219863 which decreased inline-unit-growth.

I looked into the reason and it is the fact that we do not inline NextMove
function. The function itself is big:

Inline summary for NextMove/18 inlinable                                        
  self time:       177                                                          
  global time:     0                                                            
  self size:       284                                                          
  global size:     0                                                            
  min size:       0                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:228.000000, time:150.114000, predicate:(true)                          
    size:3.000000, time:2.000000, predicate:(not inlined)                       
    size:4.000000, time:0.649000, predicate:(op0 changed)                       
    size:4.000000, time:5.946000, predicate:(op1 changed)                       
    size:2.000000, time:1.486000, predicate:(op1 == 0)                          
    size:2.000000, time:1.486000, predicate:(op1 != 0)                          

One quite wrong estiamte I see is the following:

  switch (_45) <default: <L90>, case 1: <L0>, case 2: <L4>, case 3: <L35>, case
4: <L40>, case 5: <L43>, case 6: <L50>, case 8: <L51>, case 9: <L68>, case 10:
<L92>>
                freq:1.00 size: 20 time:  6                                     
                Accounting size:20.00, time:6.00 on predicate:(true)            

This assumes jump tree implementation of switch.  We should discover dense
switch statements and estimate the jumptable.

The function overall is estimated as relatively uncool for inlining
Considering NextMove/2405 with 284 size                                         
 to be inlined into Search.constprop/4352 in unknown:-1                         
 Estimated badness is -0.081348, frequency 0.69.                                
    Badness calculation for Search.constprop/4352 -> NextMove/2405              
      size growth 273, time 174 inline hints: cross_module array_index          
      -0.040674: guessed profile. frequency 0.694000, count 0 caller count 0
time w/o inlining 397.838000, time w inlining 386.734000 overall growth 277
(current) 0 (original)
      Adjusted by hints -0.081348                                               
  not inlinable: Search.constprop/4352 -> NextMove/2405, --param
inline-unit-growth limit reached

I am not 100% sure what makes it interesting, though it sounds natural as it is
part of the innermost loop.

The function is called once, but because ipa-cp decides to clone Search
function we turn it into function called twice:
Estimating effects for Search/3356, base_time: 245.                             
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value -32768 for param #0: time_benefit: 1, size: 725          
   Estimating body: Search/3356                                                 
   Known to be false: op4 > 62, op4 changed                                     
   size:707 time:242                                                            
 - estimates for value 2 for param #4: time_benefit: 52, size: 707              
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value 0 for param #5: time_benefit: 1, size: 725               
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value 1 for param #5: time_benefit: 1, size: 725               

Evaluating opportunities for Search/3356.                                       
 - considering value 2 for param #4 (caller_count: 3)                           
     good_cloning_opportunity_p (time: 52, size: 707, freq_sum: 11298) ->
evaluation: 830, threshold: 500
  Creating a specialized node of Search/3356.                                   
    adding an extra known scalar value 1 for param #5                           
    replacing param #4 with const 2                                             
    replacing param #5 with const 1                                             
                Accounting size:3.00, time:2.00 on new predicate:(not inlined)  
                Accounting size:352.50, time:100.61 on new predicate:(op4 <=
62)
                Accounting size:21.00, time:6.87 on new predicate:(op2 changed)
&& (op4 <= 62)
                Accounting size:15.00, time:1.57 on new predicate:(op2 == 0) &&
(op4 <= 62)
                Accounting size:15.00, time:1.65 on new predicate:(op2 != 0) &&
(op4 <= 62)
                Accounting size:8.00, time:3.13 on new predicate:(op3 changed)
&& (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op2 changed)
&& (op3 <= 239) && (op4 <= 62)
                Accounting size:5.00, time:0.01 on new predicate:(op3 <= 239)
&& (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 239) && (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op2 changed)
&& (op3 > 239) && (op4 <= 62)
                Accounting size:5.00, time:0.01 on new predicate:(op3 > 239) &&
(op4 <= 62)
                Accounting size:41.00, time:0.09 on new predicate:(op3 > 120)
&& (op4 <= 62)
                Accounting size:4.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 120) && (op4 <= 62)
                Accounting size:3.00, time:0.00 on new predicate:(op3 <= 179)
&& (op3 > 120) && (op4 <= 62)
                Accounting size:2.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 179) && (op3 > 120) && (op4 <= 62)
                Accounting size:3.00, time:0.00 on new predicate:(op3 > 179) &&
(op3 > 120) && (op4 <= 62)
                Accounting size:1.00, time:0.28 on new predicate:(op2 == 1) &&
(op4 <= 62)
                Accounting size:1.00, time:0.73 on new predicate:(op2 != 1) &&
(op4 <= 62)
                Accounting size:0.50, time:0.24 on new predicate:(op4 <= 62) &&
(not inlined)
     the new node is Search.constprop/4352.                                     
 - considering value 1 for param #5 (caller_count: 10)                          
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500


Evaluating opportunities for Search/3356.                                       
 - adding an extra caller Search.constprop/4352 of Search.constprop/4352        
 - adding an extra caller Search.constprop/4352 of Search.constprop/4352        
 - considering value 1 for param #5 (caller_count: 8)                           
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500


Inline summary for Search/30 inlinable                                          
  self time:       245                                                          
  global time:     0                                                            
  self size:       725                                                          
  global size:     0                                                            
  min size:       0                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:0.000000, time:0.000000, predicate:(true)                              
    size:3.000000, time:2.000000, predicate:(not inlined)                       
    size:2.000000, time:2.000000, predicate:(op4 changed)                       
    size:352.500000, time:100.615000, predicate:(op4 <= 62)                     
    size:10.000000, time:0.552000, predicate:(op4 changed) && (op4 <= 62)       
    size:21.000000, time:6.867000, predicate:(op2 changed) && (op4 <= 62)       
    size:15.000000, time:1.573000, predicate:(op2 == 0) && (op4 <= 62)          
    size:15.000000, time:1.654000, predicate:(op2 != 0) && (op4 <= 62)          
    size:8.000000, time:3.133000, predicate:(op3 changed) && (op4 <= 62)        
    size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 <= 239) &&
(op4 <= 62)
    size:5.000000, time:0.005000, predicate:(op3 <= 239) && (op4 <= 62)         
    size:1.000000, time:0.001000, predicate:(op3 changed) && (op3 > 239) &&
(op4 <= 62)
    size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 > 239) &&
(op4 <= 62)
    size:5.000000, time:0.005000, predicate:(op3 > 239) && (op4 <= 62)          
    size:5.000000, time:0.073000, predicate:(op4 changed) && (op3 > 120) &&
(op4 <= 62)
    size:41.000000, time:0.085000, predicate:(op3 > 120) && (op4 <= 62)         
    size:4.000000, time:0.002000, predicate:(op3 changed) && (op3 > 120) &&
(op4 <= 62)
    size:3.000000, time:0.000000, predicate:(op3 <= 179) && (op3 > 120) && (op4
<= 62)
    size:2.000000, time:0.000000, predicate:(op3 changed) && (op3 > 179) &&
(op3 > 120) && (op4 <= 62)
    size:3.000000, time:0.000000, predicate:(op3 > 179) && (op3 > 120) && (op4
<= 62)
    size:1.000000, time:0.285000, predicate:(op2 == 1) && (op4 <= 62)           
    size:1.000000, time:0.734000, predicate:(op2 != 1) && (op4 <= 62)           
    size:0.500000, time:0.244000, predicate:(op4 <= 62) && (not inlined)        
    size:1.000000, time:0.389000, predicate:(op1 changed) && (op4 > 62) && (not
inlined)
  array index:(op4 changed)                                                     

I suppose we can do the following
 - bump inline-unit-growth to 25 so the function is inlined (this is quite
ugly)
 - consider crafty to not be large unit 
   Unit growth for small function inlining: 27793->31749 (14%)
 - convince ipa-cp to either not duplicate Search or duplicate NextMove too
 - make arrayindex to be more important than it is
 - analyze switch better
 - figure out why inling happens and strenghten analysis.

I will continue looking into this tomorrow.
Honza

[Bug tree-optimization/65478] New: crafty performance regression

Reply via email to