On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka <[email protected]> wrote: >> What I've observed on power is that LTO alone reduces performance and >> LTO+FDO is not significantly different than FDO alone. > On SPEC2k6? > > This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems > off-noise win on SPEC2k6 > http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html > > I do not see why PPC should be significantly more constrained by register > pressure. > > I do not have head to head comparsion of FDO and FDO+LTO for SPEC > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html > shows noticeable drop in calculix and gamess. > Martin profiled calculix and tracked it down to a loop that is not trained > but hot in the reference run. That makes it optimized for size. > > http://dromaeo.com/?id=219677,219672,219965,219877 > compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO > Here the benefits of LTO and FDO seems to add up nicely. >> >> I agree that an exact estimate of the register pressure would be a >> difficult problem. I'm hoping that something that approximates potential >> register pressure downstream will be sufficient to help inlining >> decisions. > > Yep, register pressure and I-cache overhead estimates are used for inline > decisions by some compilers. > > I am mostly concerned about the metric suffering from GIGO principe if we mix > together too many estimates that are somehwat wrong by their nature. This is > why I mostly tried to focus on size/time estimates and not add too many other > metrics. But perhaps it is a time to experiment wit these, since obviously we > pushed current infrastructure to mostly to its limits. >
I like the word GIGO here. Getting inline signals right requires deep analysis (including interprocedural analysis). Different signals/hints may also come with different quality thus different weights. Another challenge is how to quantify cycle savings/overhead more precisely. With that, we can abandon the threshold based scheme -- any callsite with a net saving will be considered. David > Honza >> >> Aaron >> >> On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote: >> > Do you witness similar problems with LTO +FDO? >> > >> > My concern is it can be tricky to get the register pressure estimate >> > right. The register pressure problem is created by downstream >> > components (code motions etc) but only exposed by the inliner. If you >> > want to get it 'right' (i.e., not exposing the problems), you will >> > need to bake the knowledge of the downstream components (possibly >> > bugs) into the analysis which might not be a good thing to do longer >> > term. >> > >> > David >> > >> > On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey >> > <[email protected]> wrote: >> > > Honza, >> > > Seeing your recent patches relating to inliner heuristics for LTO, I >> > > thought I should mention some related work I'm doing. >> > > >> > > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain >> > > team, working on gcc performance. >> > > >> > > We have not generally seen good results using LTO on IBM power processors >> > > and one of the problems seems to be excessive inlining that results in >> > > the >> > > generation of excessive spill code. So, I have set out to tackle this by >> > > doing some analysis at the time of the inliner pass to compute something >> > > analogous to register pressure, which is then used to shut down inlining >> > > of >> > > routines that have a lot of pressure. >> > > >> > > The analysis is basically a liveness analysis on the SSA names per basic >> > > block and looking for the maximum number live in any block. I've been >> > > using >> > > "liveness pressure" as a shorthand name for this. >> > > >> > > This can then be used in two ways. >> > > 1) want_inline_function_to_all_callers_p at present always says to inline >> > > things that have only one call site without regard to size or what this >> > > may >> > > do to the register allocator downstream. In particular, BZ2_decompress in >> > > bzip2 gets inlined and this causes the pressure reported downstream for >> > > the >> > > int register class to increase 10x. Looking at some combination of >> > > pressure >> > > in caller/callee may help avoid this kind of situation. >> > > 2) I also want to experiment with adding the liveness pressure in the >> > > callee >> > > into the badness calculation in edge_badness used by >> > > inline_small_functions. >> > > The idea here is to try to inline functions that are less likely to cause >> > > register allocator difficulty downstream first. >> > > >> > > I am just at the point of getting a prototype working, I will get a patch >> > > you could take a look at posted next week. In the meantime, do you have >> > > any >> > > comments or feedback? >> > > >> > > Thanks, >> > > Aaron >> > > >> > > -- >> > > Aaron Sawdey, Ph.D. [email protected] >> > > 050-2/C113 (507) 253-7520 home: 507/263-0782 >> > > IBM Linux Technology Center - PPC Toolchain >> > > >> > >> >> -- >> Aaron Sawdey, Ph.D. [email protected] >> 050-2/C113 (507) 253-7520 home: 507/263-0782 >> IBM Linux Technology Center - PPC Toolchain
