LTO inliner -- sensitivity to increasing register pressure
Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
Hello, Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. This is intresting. I sort of planned to add register pressure logic but always tought it is somewhat hard to do at GIMPLE level in a way that would work for all CPUs. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. I believe this is usually called width This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. Sounds interesting. I am very curious if you can get consistent improvements with this. I only implemented logic for large stack frames, but in C++ code it seems often to do more harm than good. If you find examples of bad inlining, can you also fill it into bugzilla? Perhaps the individual cases could be handled better by improving IRA. Honza I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
Do you witness similar problems with LTO +FDO? My concern is it can be tricky to get the register pressure estimate right. The register pressure problem is created by downstream components (code motions etc) but only exposed by the inliner. If you want to get it 'right' (i.e., not exposing the problems), you will need to bake the knowledge of the downstream components (possibly bugs) into the analysis which might not be a good thing to do longer term. David On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey acsaw...@linux.vnet.ibm.com wrote: Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, Apr 18, 2014 at 10:26 AM, Jan Hubicka hubi...@ucw.cz wrote: Hello, Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. This is intresting. I sort of planned to add register pressure logic but always tought it is somewhat hard to do at GIMPLE level in a way that would work for all CPUs. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. I believe this is usually called width This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. Sounds interesting. I am very curious if you can get consistent improvements with this. I only implemented logic for large stack frames, but in C++ code it seems often to do more harm than good. If you find examples of bad inlining, can you also fill it into bugzilla? Perhaps the individual cases could be handled better by improving IRA. yes -- I think this is the right time to do regardless. David Honza I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, 2014-04-18 at 19:26 +0200, Jan Hubicka wrote: This is intresting. I sort of planned to add register pressure logic but always tought it is somewhat hard to do at GIMPLE level in a way that would work for all CPUs. Yes, this is just meant to try to measure something that is representative of register pressure. Different architectures would probably want different thresholds for this. If you find examples of bad inlining, can you also fill it into bugzilla? Perhaps the individual cases could be handled better by improving IRA. Yes, I can do that for the case that's happening in bzip2. Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. Aaron On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote: Do you witness similar problems with LTO +FDO? My concern is it can be tricky to get the register pressure estimate right. The register pressure problem is created by downstream components (code motions etc) but only exposed by the inliner. If you want to get it 'right' (i.e., not exposing the problems), you will need to bake the knowledge of the downstream components (possibly bugs) into the analysis which might not be a good thing to do longer term. David On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey acsaw...@linux.vnet.ibm.com wrote: Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. On SPEC2k6? This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems off-noise win on SPEC2k6 http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html I do not see why PPC should be significantly more constrained by register pressure. I do not have head to head comparsion of FDO and FDO+LTO for SPEC http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html shows noticeable drop in calculix and gamess. Martin profiled calculix and tracked it down to a loop that is not trained but hot in the reference run. That makes it optimized for size. http://dromaeo.com/?id=219677,219672,219965,219877 compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO Here the benefits of LTO and FDO seems to add up nicely. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. Yep, register pressure and I-cache overhead estimates are used for inline decisions by some compilers. I am mostly concerned about the metric suffering from GIGO principe if we mix together too many estimates that are somehwat wrong by their nature. This is why I mostly tried to focus on size/time estimates and not add too many other metrics. But perhaps it is a time to experiment wit these, since obviously we pushed current infrastructure to mostly to its limits. Honza Aaron On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote: Do you witness similar problems with LTO +FDO? My concern is it can be tricky to get the register pressure estimate right. The register pressure problem is created by downstream components (code motions etc) but only exposed by the inliner. If you want to get it 'right' (i.e., not exposing the problems), you will need to bake the knowledge of the downstream components (possibly bugs) into the analysis which might not be a good thing to do longer term. David On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey acsaw...@linux.vnet.ibm.com wrote: Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: LTO inliner -- sensitivity to increasing register pressure
What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. One (ortoghonal) way to deal with this problem would be also to disable inlining of functions called once when the edge frequency is low. I.e. adding to check_callers something like edge-frequency CGRAPH_FREQ_BASE / 2 if you want to disqualify all calls that have only 50% chance that they will be called during function invocation. Does something like that help in your cases? It would help in the case Linus complained about http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 The difficulty here is that disabling inlies on not so important paths may prevent SRA and other optimizations so it may in turn also penalize the hot path. I saw this in some cases where EH cleanup code was optimized for size. Perhaps SRA canalso be extended to handle cases where non-SRAable code is on a cold path? Honza
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote: What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. On SPEC2k6? This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems off-noise win on SPEC2k6 http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html I do not see why PPC should be significantly more constrained by register pressure. I do not have head to head comparsion of FDO and FDO+LTO for SPEC http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html shows noticeable drop in calculix and gamess. Martin profiled calculix and tracked it down to a loop that is not trained but hot in the reference run. That makes it optimized for size. http://dromaeo.com/?id=219677,219672,219965,219877 compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO Here the benefits of LTO and FDO seems to add up nicely. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. Yep, register pressure and I-cache overhead estimates are used for inline decisions by some compilers. I am mostly concerned about the metric suffering from GIGO principe if we mix together too many estimates that are somehwat wrong by their nature. This is why I mostly tried to focus on size/time estimates and not add too many other metrics. But perhaps it is a time to experiment wit these, since obviously we pushed current infrastructure to mostly to its limits. I like the word GIGO here. Getting inline signals right requires deep analysis (including interprocedural analysis). Different signals/hints may also come with different quality thus different weights. Another challenge is how to quantify cycle savings/overhead more precisely. With that, we can abandon the threshold based scheme -- any callsite with a net saving will be considered. David Honza Aaron On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote: Do you witness similar problems with LTO +FDO? My concern is it can be tricky to get the register pressure estimate right. The register pressure problem is created by downstream components (code motions etc) but only exposed by the inliner. If you want to get it 'right' (i.e., not exposing the problems), you will need to bake the knowledge of the downstream components (possibly bugs) into the analysis which might not be a good thing to do longer term. David On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey acsaw...@linux.vnet.ibm.com wrote: Honza, Seeing your recent patches relating to inliner heuristics for LTO, I thought I should mention some related work I'm doing. By way of introduction, I've recently joined the IBM LTC's PPC Toolchain team, working on gcc performance. We have not generally seen good results using LTO on IBM power processors and one of the problems seems to be excessive inlining that results in the generation of excessive spill code. So, I have set out to tackle this by doing some analysis at the time of the inliner pass to compute something analogous to register pressure, which is then used to shut down inlining of routines that have a lot of pressure. The analysis is basically a liveness analysis on the SSA names per basic block and looking for the maximum number live in any block. I've been using liveness pressure as a shorthand name for this. This can then be used in two ways. 1) want_inline_function_to_all_callers_p at present always says to inline things that have only one call site without regard to size or what this may do to the register allocator downstream. In particular, BZ2_decompress in bzip2 gets inlined and this causes the pressure reported downstream for the int register class to increase 10x. Looking at some combination of pressure in caller/callee may help avoid this kind of situation. 2) I also want to experiment with adding the liveness pressure in the callee into the badness calculation in edge_badness used by inline_small_functions. The idea here is to try to inline functions that are less likely to cause register allocator difficulty downstream first. I am just at the point of getting a prototype working, I will get a patch you could take a look at posted next week. In the meantime, do you have any comments or feedback? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, Apr 18, 2014 at 12:51 PM, Jan Hubicka hubi...@ucw.cz wrote: What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. One (ortoghonal) way to deal with this problem would be also to disable inlining of functions called once when the edge frequency is low. I.e. adding to check_callers something like edge-frequency CGRAPH_FREQ_BASE / 2 if you want to disqualify all calls that have only 50% chance that they will be called during function invocation. Does something like that help in your cases? It would help in the case Linus complained about http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 The difficulty here is that disabling inlies on not so important paths may prevent SRA and other optimizations so it may in turn also penalize the hot path. I saw this in some cases where EH cleanup code was optimized for size. yes. The callsite may be cold, but the profile scaled callee body may still be hot. Inlining the callee allows more context to be passed. Similarly for hot callers, cold callees may also expose more information (e.g, better alias info) to the caller. David Perhaps SRA canalso be extended to handle cases where non-SRAable code is on a cold path? Honza
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote: What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. On SPEC2k6? This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems off-noise win on SPEC2k6 http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html I do not see why PPC should be significantly more constrained by register pressure. I do not have head to head comparsion of FDO and FDO+LTO for SPEC http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html shows noticeable drop in calculix and gamess. Martin profiled calculix and tracked it down to a loop that is not trained but hot in the reference run. That makes it optimized for size. http://dromaeo.com/?id=219677,219672,219965,219877 compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO Here the benefits of LTO and FDO seems to add up nicely. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. Yep, register pressure and I-cache overhead estimates are used for inline decisions by some compilers. I am mostly concerned about the metric suffering from GIGO principe if we mix together too many estimates that are somehwat wrong by their nature. This is why I mostly tried to focus on size/time estimates and not add too many other metrics. But perhaps it is a time to experiment wit these, since obviously we pushed current infrastructure to mostly to its limits. I like the word GIGO here. Getting inline signals right requires deep analysis (including interprocedural analysis). Different signals/hints may also come with different quality thus different weights. Another challenge is how to quantify cycle savings/overhead more precisely. With that, we can abandon the threshold based scheme -- any callsite with a net saving will be considered. Inline hints are intended to do this - at the moment we bump the limits up when we estimate big speedups for the inlining and with today patch and FDO we bypass the thresholds when we know from FDO that call matters. Concerning your other email, indeed we should consider heavy callees (in Open64 terminology) that consume a lot of time and do not skip the call sites. Easy way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply multiplies the count and estimated time. (We probably gouth to get rid of the time capping and use wider arithmetics too). I wonder if that is not too local and if we should not try to estimate cumulative time of the function and get more agressive on inlining over the whole path leading to hot code. Honza
Re: LTO inliner -- sensitivity to increasing register pressure
On Fri, Apr 18, 2014 at 2:16 PM, Jan Hubicka hubi...@ucw.cz wrote: On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote: What I've observed on power is that LTO alone reduces performance and LTO+FDO is not significantly different than FDO alone. On SPEC2k6? This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems off-noise win on SPEC2k6 http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html I do not see why PPC should be significantly more constrained by register pressure. I do not have head to head comparsion of FDO and FDO+LTO for SPEC http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html shows noticeable drop in calculix and gamess. Martin profiled calculix and tracked it down to a loop that is not trained but hot in the reference run. That makes it optimized for size. http://dromaeo.com/?id=219677,219672,219965,219877 compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO Here the benefits of LTO and FDO seems to add up nicely. I agree that an exact estimate of the register pressure would be a difficult problem. I'm hoping that something that approximates potential register pressure downstream will be sufficient to help inlining decisions. Yep, register pressure and I-cache overhead estimates are used for inline decisions by some compilers. I am mostly concerned about the metric suffering from GIGO principe if we mix together too many estimates that are somehwat wrong by their nature. This is why I mostly tried to focus on size/time estimates and not add too many other metrics. But perhaps it is a time to experiment wit these, since obviously we pushed current infrastructure to mostly to its limits. I like the word GIGO here. Getting inline signals right requires deep analysis (including interprocedural analysis). Different signals/hints may also come with different quality thus different weights. Another challenge is how to quantify cycle savings/overhead more precisely. With that, we can abandon the threshold based scheme -- any callsite with a net saving will be considered. Inline hints are intended to do this - at the moment we bump the limits up when we estimate big speedups for the inlining and with today patch and FDO we bypass the thresholds when we know from FDO that call matters. Concerning your other email, indeed we should consider heavy callees (in Open64 terminology) that consume a lot of time and do not skip the call sites. Easy way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply multiplies the count and estimated time. (We probably gouth to get rid of the time capping and use wider arithmetics too). That's what we did in Google branches. We had two heuristics -- hot caller and hot callee heuristics. 1) For the hot caller heuristic, other simple analysis is checked a) global working set size; b) callsite argument check -- very simple check to guess if inlining this callsite would sharpen analysis 2) We had not tuned hot callee heuristic by doing more analysis -- simply turn in on using hotness does not make a noticable differences. Other hints are needed. David I wonder if that is not too local and if we should not try to estimate cumulative time of the function and get more agressive on inlining over the whole path leading to hot code. Honza