https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478
Bug ID: 65478 Summary: crafty performance regression Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org As reported to me privately by Igor, crafty SPEC2k benchmark is slower since r219863 which decreased inline-unit-growth. I looked into the reason and it is the fact that we do not inline NextMove function. The function itself is big: Inline summary for NextMove/18 inlinable self time: 177 global time: 0 self size: 284 global size: 0 min size: 0 self stack: 0 global stack: 0 size:228.000000, time:150.114000, predicate:(true) size:3.000000, time:2.000000, predicate:(not inlined) size:4.000000, time:0.649000, predicate:(op0 changed) size:4.000000, time:5.946000, predicate:(op1 changed) size:2.000000, time:1.486000, predicate:(op1 == 0) size:2.000000, time:1.486000, predicate:(op1 != 0) One quite wrong estiamte I see is the following: switch (_45) <default: <L90>, case 1: <L0>, case 2: <L4>, case 3: <L35>, case 4: <L40>, case 5: <L43>, case 6: <L50>, case 8: <L51>, case 9: <L68>, case 10: <L92>> freq:1.00 size: 20 time: 6 Accounting size:20.00, time:6.00 on predicate:(true) This assumes jump tree implementation of switch. We should discover dense switch statements and estimate the jumptable. The function overall is estimated as relatively uncool for inlining Considering NextMove/2405 with 284 size to be inlined into Search.constprop/4352 in unknown:-1 Estimated badness is -0.081348, frequency 0.69. Badness calculation for Search.constprop/4352 -> NextMove/2405 size growth 273, time 174 inline hints: cross_module array_index -0.040674: guessed profile. frequency 0.694000, count 0 caller count 0 time w/o inlining 397.838000, time w inlining 386.734000 overall growth 277 (current) 0 (original) Adjusted by hints -0.081348 not inlinable: Search.constprop/4352 -> NextMove/2405, --param inline-unit-growth limit reached I am not 100% sure what makes it interesting, though it sounds natural as it is part of the innermost loop. The function is called once, but because ipa-cp decides to clone Search function we turn it into function called twice: Estimating effects for Search/3356, base_time: 245. Estimating body: Search/3356 Known to be false: size:725 time:245 - estimates for value -32768 for param #0: time_benefit: 1, size: 725 Estimating body: Search/3356 Known to be false: op4 > 62, op4 changed size:707 time:242 - estimates for value 2 for param #4: time_benefit: 52, size: 707 Estimating body: Search/3356 Known to be false: size:725 time:245 - estimates for value 0 for param #5: time_benefit: 1, size: 725 Estimating body: Search/3356 Known to be false: size:725 time:245 - estimates for value 1 for param #5: time_benefit: 1, size: 725 Evaluating opportunities for Search/3356. - considering value 2 for param #4 (caller_count: 3) good_cloning_opportunity_p (time: 52, size: 707, freq_sum: 11298) -> evaluation: 830, threshold: 500 Creating a specialized node of Search/3356. adding an extra known scalar value 1 for param #5 replacing param #4 with const 2 replacing param #5 with const 1 Accounting size:3.00, time:2.00 on new predicate:(not inlined) Accounting size:352.50, time:100.61 on new predicate:(op4 <= 62) Accounting size:21.00, time:6.87 on new predicate:(op2 changed) && (op4 <= 62) Accounting size:15.00, time:1.57 on new predicate:(op2 == 0) && (op4 <= 62) Accounting size:15.00, time:1.65 on new predicate:(op2 != 0) && (op4 <= 62) Accounting size:8.00, time:3.13 on new predicate:(op3 changed) && (op4 <= 62) Accounting size:1.00, time:0.00 on new predicate:(op2 changed) && (op3 <= 239) && (op4 <= 62) Accounting size:5.00, time:0.01 on new predicate:(op3 <= 239) && (op4 <= 62) Accounting size:1.00, time:0.00 on new predicate:(op3 changed) && (op3 > 239) && (op4 <= 62) Accounting size:1.00, time:0.00 on new predicate:(op2 changed) && (op3 > 239) && (op4 <= 62) Accounting size:5.00, time:0.01 on new predicate:(op3 > 239) && (op4 <= 62) Accounting size:41.00, time:0.09 on new predicate:(op3 > 120) && (op4 <= 62) Accounting size:4.00, time:0.00 on new predicate:(op3 changed) && (op3 > 120) && (op4 <= 62) Accounting size:3.00, time:0.00 on new predicate:(op3 <= 179) && (op3 > 120) && (op4 <= 62) Accounting size:2.00, time:0.00 on new predicate:(op3 changed) && (op3 > 179) && (op3 > 120) && (op4 <= 62) Accounting size:3.00, time:0.00 on new predicate:(op3 > 179) && (op3 > 120) && (op4 <= 62) Accounting size:1.00, time:0.28 on new predicate:(op2 == 1) && (op4 <= 62) Accounting size:1.00, time:0.73 on new predicate:(op2 != 1) && (op4 <= 62) Accounting size:0.50, time:0.24 on new predicate:(op4 <= 62) && (not inlined) the new node is Search.constprop/4352. - considering value 1 for param #5 (caller_count: 10) good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) -> evaluation: 1, threshold: 500 good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) -> evaluation: 1, threshold: 500 Evaluating opportunities for Search/3356. - adding an extra caller Search.constprop/4352 of Search.constprop/4352 - adding an extra caller Search.constprop/4352 of Search.constprop/4352 - considering value 1 for param #5 (caller_count: 8) good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) -> evaluation: 1, threshold: 500 good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) -> evaluation: 1, threshold: 500 Inline summary for Search/30 inlinable self time: 245 global time: 0 self size: 725 global size: 0 min size: 0 self stack: 0 global stack: 0 size:0.000000, time:0.000000, predicate:(true) size:3.000000, time:2.000000, predicate:(not inlined) size:2.000000, time:2.000000, predicate:(op4 changed) size:352.500000, time:100.615000, predicate:(op4 <= 62) size:10.000000, time:0.552000, predicate:(op4 changed) && (op4 <= 62) size:21.000000, time:6.867000, predicate:(op2 changed) && (op4 <= 62) size:15.000000, time:1.573000, predicate:(op2 == 0) && (op4 <= 62) size:15.000000, time:1.654000, predicate:(op2 != 0) && (op4 <= 62) size:8.000000, time:3.133000, predicate:(op3 changed) && (op4 <= 62) size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 <= 239) && (op4 <= 62) size:5.000000, time:0.005000, predicate:(op3 <= 239) && (op4 <= 62) size:1.000000, time:0.001000, predicate:(op3 changed) && (op3 > 239) && (op4 <= 62) size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 > 239) && (op4 <= 62) size:5.000000, time:0.005000, predicate:(op3 > 239) && (op4 <= 62) size:5.000000, time:0.073000, predicate:(op4 changed) && (op3 > 120) && (op4 <= 62) size:41.000000, time:0.085000, predicate:(op3 > 120) && (op4 <= 62) size:4.000000, time:0.002000, predicate:(op3 changed) && (op3 > 120) && (op4 <= 62) size:3.000000, time:0.000000, predicate:(op3 <= 179) && (op3 > 120) && (op4 <= 62) size:2.000000, time:0.000000, predicate:(op3 changed) && (op3 > 179) && (op3 > 120) && (op4 <= 62) size:3.000000, time:0.000000, predicate:(op3 > 179) && (op3 > 120) && (op4 <= 62) size:1.000000, time:0.285000, predicate:(op2 == 1) && (op4 <= 62) size:1.000000, time:0.734000, predicate:(op2 != 1) && (op4 <= 62) size:0.500000, time:0.244000, predicate:(op4 <= 62) && (not inlined) size:1.000000, time:0.389000, predicate:(op1 changed) && (op4 > 62) && (not inlined) array index:(op4 changed) I suppose we can do the following - bump inline-unit-growth to 25 so the function is inlined (this is quite ugly) - consider crafty to not be large unit Unit growth for small function inlining: 27793->31749 (14%) - convince ipa-cp to either not duplicate Search or duplicate NextMove too - make arrayindex to be more important than it is - analyze switch better - figure out why inling happens and strenghten analysis. I will continue looking into this tomorrow. Honza