LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey

Honza,
  Seeing your recent patches relating to inliner heuristics for LTO, I 
thought I should mention some related work I'm doing.


By way of introduction, I've recently joined the IBM LTC's PPC Toolchain 
team, working on gcc performance.


We have not generally seen good results using LTO on IBM power processors 
and one of the problems seems to be excessive inlining that results in the 
generation of excessive spill code. So, I have set out to tackle this by 
doing some analysis at the time of the inliner pass to compute something 
analogous to register pressure, which is then used to shut down inlining of 
routines that have a lot of pressure.


The analysis is basically a liveness analysis on the SSA names per basic 
block and looking for the maximum number live in any block. I've been using 
liveness pressure as a shorthand name for this.


This can then be used in two ways.
1) want_inline_function_to_all_callers_p at present always says to inline 
things that have only one call site without regard to size or what this may 
do to the register allocator downstream. In particular, BZ2_decompress in 
bzip2 gets inlined and this causes the pressure reported downstream for the 
int register class to increase 10x. Looking at some combination of pressure 
in caller/callee may help avoid this kind of situation.
2) I also want to experiment with adding the liveness pressure in the 
callee into the badness calculation in edge_badness used by 
inline_small_functions. The idea here is to try to inline functions that 
are less likely to cause register allocator difficulty downstream first.


I am just at the point of getting a prototype working, I will get a patch 
you could take a look at posted next week. In the meantime, do you have any 
comments or feedback?


Thanks,
   Aaron

--
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
Hello,
 Honza,
   Seeing your recent patches relating to inliner heuristics for LTO,
 I thought I should mention some related work I'm doing.
 
 By way of introduction, I've recently joined the IBM LTC's PPC
 Toolchain team, working on gcc performance.
 
 We have not generally seen good results using LTO on IBM power
 processors and one of the problems seems to be excessive inlining
 that results in the generation of excessive spill code. So, I have
 set out to tackle this by doing some analysis at the time of the
 inliner pass to compute something analogous to register pressure,
 which is then used to shut down inlining of routines that have a lot
 of pressure.

This is intresting.  I sort of planned to add register pressure logic
but always tought it is somewhat hard to do at GIMPLE level in a way
that would work for all CPUs.
 
 The analysis is basically a liveness analysis on the SSA names per
 basic block and looking for the maximum number live in any block.
 I've been using liveness pressure as a shorthand name for this.

I believe this is usually called width
 
 This can then be used in two ways.
 1) want_inline_function_to_all_callers_p at present always says to
 inline things that have only one call site without regard to size or
 what this may do to the register allocator downstream. In
 particular, BZ2_decompress in bzip2 gets inlined and this causes the
 pressure reported downstream for the int register class to increase
 10x. Looking at some combination of pressure in caller/callee may
 help avoid this kind of situation.
 2) I also want to experiment with adding the liveness pressure in
 the callee into the badness calculation in edge_badness used by
 inline_small_functions. The idea here is to try to inline functions
 that are less likely to cause register allocator difficulty
 downstream first.

Sounds interesting.  I am very curious if you can get consistent improvements
with this.  I only implemented logic for large stack frames, but in C++ code
it seems often to do more harm than good.

If you find examples of bad inlining, can you also fill it into bugzilla?
Perhaps the individual cases could be handled better by improving IRA.

Honza
 
 I am just at the point of getting a prototype working, I will get a
 patch you could take a look at posted next week. In the meantime, do
 you have any comments or feedback?
 
 Thanks,
Aaron
 
 -- 
 Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
 050-2/C113  (507) 253-7520 home: 507/263-0782
 IBM Linux Technology Center - PPC Toolchain


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
Do you witness similar problems with LTO +FDO?

My concern is it can be tricky to get the register pressure estimate
right. The register pressure problem is created by downstream
components (code motions etc) but only exposed by the inliner.  If you
want to get it 'right' (i.e., not exposing the problems), you will
need to bake the knowledge of the downstream components (possibly
bugs) into the analysis which might not be a good thing to do longer
term.

David

On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
acsaw...@linux.vnet.ibm.com wrote:
 Honza,
   Seeing your recent patches relating to inliner heuristics for LTO, I
 thought I should mention some related work I'm doing.

 By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
 team, working on gcc performance.

 We have not generally seen good results using LTO on IBM power processors
 and one of the problems seems to be excessive inlining that results in the
 generation of excessive spill code. So, I have set out to tackle this by
 doing some analysis at the time of the inliner pass to compute something
 analogous to register pressure, which is then used to shut down inlining of
 routines that have a lot of pressure.

 The analysis is basically a liveness analysis on the SSA names per basic
 block and looking for the maximum number live in any block. I've been using
 liveness pressure as a shorthand name for this.

 This can then be used in two ways.
 1) want_inline_function_to_all_callers_p at present always says to inline
 things that have only one call site without regard to size or what this may
 do to the register allocator downstream. In particular, BZ2_decompress in
 bzip2 gets inlined and this causes the pressure reported downstream for the
 int register class to increase 10x. Looking at some combination of pressure
 in caller/callee may help avoid this kind of situation.
 2) I also want to experiment with adding the liveness pressure in the callee
 into the badness calculation in edge_badness used by inline_small_functions.
 The idea here is to try to inline functions that are less likely to cause
 register allocator difficulty downstream first.

 I am just at the point of getting a prototype working, I will get a patch
 you could take a look at posted next week. In the meantime, do you have any
 comments or feedback?

 Thanks,
Aaron

 --
 Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
 050-2/C113  (507) 253-7520 home: 507/263-0782
 IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 10:26 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Hello,
 Honza,
   Seeing your recent patches relating to inliner heuristics for LTO,
 I thought I should mention some related work I'm doing.

 By way of introduction, I've recently joined the IBM LTC's PPC
 Toolchain team, working on gcc performance.

 We have not generally seen good results using LTO on IBM power
 processors and one of the problems seems to be excessive inlining
 that results in the generation of excessive spill code. So, I have
 set out to tackle this by doing some analysis at the time of the
 inliner pass to compute something analogous to register pressure,
 which is then used to shut down inlining of routines that have a lot
 of pressure.

 This is intresting.  I sort of planned to add register pressure logic
 but always tought it is somewhat hard to do at GIMPLE level in a way
 that would work for all CPUs.

 The analysis is basically a liveness analysis on the SSA names per
 basic block and looking for the maximum number live in any block.
 I've been using liveness pressure as a shorthand name for this.

 I believe this is usually called width

 This can then be used in two ways.
 1) want_inline_function_to_all_callers_p at present always says to
 inline things that have only one call site without regard to size or
 what this may do to the register allocator downstream. In
 particular, BZ2_decompress in bzip2 gets inlined and this causes the
 pressure reported downstream for the int register class to increase
 10x. Looking at some combination of pressure in caller/callee may
 help avoid this kind of situation.
 2) I also want to experiment with adding the liveness pressure in
 the callee into the badness calculation in edge_badness used by
 inline_small_functions. The idea here is to try to inline functions
 that are less likely to cause register allocator difficulty
 downstream first.

 Sounds interesting.  I am very curious if you can get consistent improvements
 with this.  I only implemented logic for large stack frames, but in C++ code
 it seems often to do more harm than good.

 If you find examples of bad inlining, can you also fill it into bugzilla?
 Perhaps the individual cases could be handled better by improving IRA.

yes -- I think this is the right time to do regardless.

David



 Honza

 I am just at the point of getting a prototype working, I will get a
 patch you could take a look at posted next week. In the meantime, do
 you have any comments or feedback?

 Thanks,
Aaron

 --
 Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
 050-2/C113  (507) 253-7520 home: 507/263-0782
 IBM Linux Technology Center - PPC Toolchain


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey
On Fri, 2014-04-18 at 19:26 +0200, Jan Hubicka wrote:
 This is intresting.  I sort of planned to add register pressure logic
 but always tought it is somewhat hard to do at GIMPLE level in a way
 that would work for all CPUs.

Yes, this is just meant to try to measure something that is
representative of register pressure. Different architectures would
probably want different thresholds for this.

 If you find examples of bad inlining, can you also fill it into bugzilla?
 Perhaps the individual cases could be handled better by improving IRA.

Yes, I can do that for the case that's happening in bzip2.

  Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Aaron Sawdey
What I've observed on power is that LTO alone reduces performance and
LTO+FDO is not significantly different than FDO alone.

I agree that an exact estimate of the register pressure would be a
difficult problem. I'm hoping that something that approximates potential
register pressure downstream will be sufficient to help inlining
decisions. 

  Aaron

On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
 Do you witness similar problems with LTO +FDO?
 
 My concern is it can be tricky to get the register pressure estimate
 right. The register pressure problem is created by downstream
 components (code motions etc) but only exposed by the inliner.  If you
 want to get it 'right' (i.e., not exposing the problems), you will
 need to bake the knowledge of the downstream components (possibly
 bugs) into the analysis which might not be a good thing to do longer
 term.
 
 David
 
 On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
 acsaw...@linux.vnet.ibm.com wrote:
  Honza,
Seeing your recent patches relating to inliner heuristics for LTO, I
  thought I should mention some related work I'm doing.
 
  By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
  team, working on gcc performance.
 
  We have not generally seen good results using LTO on IBM power processors
  and one of the problems seems to be excessive inlining that results in the
  generation of excessive spill code. So, I have set out to tackle this by
  doing some analysis at the time of the inliner pass to compute something
  analogous to register pressure, which is then used to shut down inlining of
  routines that have a lot of pressure.
 
  The analysis is basically a liveness analysis on the SSA names per basic
  block and looking for the maximum number live in any block. I've been using
  liveness pressure as a shorthand name for this.
 
  This can then be used in two ways.
  1) want_inline_function_to_all_callers_p at present always says to inline
  things that have only one call site without regard to size or what this may
  do to the register allocator downstream. In particular, BZ2_decompress in
  bzip2 gets inlined and this causes the pressure reported downstream for the
  int register class to increase 10x. Looking at some combination of pressure
  in caller/callee may help avoid this kind of situation.
  2) I also want to experiment with adding the liveness pressure in the callee
  into the badness calculation in edge_badness used by inline_small_functions.
  The idea here is to try to inline functions that are less likely to cause
  register allocator difficulty downstream first.
 
  I am just at the point of getting a prototype working, I will get a patch
  you could take a look at posted next week. In the meantime, do you have any
  comments or feedback?
 
  Thanks,
 Aaron
 
  --
  Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
  050-2/C113  (507) 253-7520 home: 507/263-0782
  IBM Linux Technology Center - PPC Toolchain
 
 

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
 What I've observed on power is that LTO alone reduces performance and
 LTO+FDO is not significantly different than FDO alone.
On SPEC2k6?

This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
off-noise win on SPEC2k6
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html

I do not see why PPC should be significantly more constrained by register
pressure.  

I do not have head to head comparsion of FDO and FDO+LTO for SPEC
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
shows noticeable drop in calculix and gamess.
Martin profiled calculix and tracked it down to a loop that is not trained
but hot in the reference run.  That makes it optimized for size.  

http://dromaeo.com/?id=219677,219672,219965,219877
compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
Here the benefits of LTO and FDO seems to add up nicely.
 
 I agree that an exact estimate of the register pressure would be a
 difficult problem. I'm hoping that something that approximates potential
 register pressure downstream will be sufficient to help inlining
 decisions. 

Yep, register pressure and I-cache overhead estimates are used for inline
decisions by some compilers.

I am mostly concerned about the metric suffering from GIGO principe if we mix
together too many estimates that are somehwat wrong by their nature. This is
why I mostly tried to focus on size/time estimates and not add too many other
metrics. But perhaps it is a time to experiment wit these, since obviously we
pushed current infrastructure to mostly to its limits.

Honza
 
   Aaron
 
 On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
  Do you witness similar problems with LTO +FDO?
  
  My concern is it can be tricky to get the register pressure estimate
  right. The register pressure problem is created by downstream
  components (code motions etc) but only exposed by the inliner.  If you
  want to get it 'right' (i.e., not exposing the problems), you will
  need to bake the knowledge of the downstream components (possibly
  bugs) into the analysis which might not be a good thing to do longer
  term.
  
  David
  
  On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
  acsaw...@linux.vnet.ibm.com wrote:
   Honza,
 Seeing your recent patches relating to inliner heuristics for LTO, I
   thought I should mention some related work I'm doing.
  
   By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
   team, working on gcc performance.
  
   We have not generally seen good results using LTO on IBM power processors
   and one of the problems seems to be excessive inlining that results in the
   generation of excessive spill code. So, I have set out to tackle this by
   doing some analysis at the time of the inliner pass to compute something
   analogous to register pressure, which is then used to shut down inlining 
   of
   routines that have a lot of pressure.
  
   The analysis is basically a liveness analysis on the SSA names per basic
   block and looking for the maximum number live in any block. I've been 
   using
   liveness pressure as a shorthand name for this.
  
   This can then be used in two ways.
   1) want_inline_function_to_all_callers_p at present always says to inline
   things that have only one call site without regard to size or what this 
   may
   do to the register allocator downstream. In particular, BZ2_decompress in
   bzip2 gets inlined and this causes the pressure reported downstream for 
   the
   int register class to increase 10x. Looking at some combination of 
   pressure
   in caller/callee may help avoid this kind of situation.
   2) I also want to experiment with adding the liveness pressure in the 
   callee
   into the badness calculation in edge_badness used by 
   inline_small_functions.
   The idea here is to try to inline functions that are less likely to cause
   register allocator difficulty downstream first.
  
   I am just at the point of getting a prototype working, I will get a patch
   you could take a look at posted next week. In the meantime, do you have 
   any
   comments or feedback?
  
   Thanks,
  Aaron
  
   --
   Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
   050-2/C113  (507) 253-7520 home: 507/263-0782
   IBM Linux Technology Center - PPC Toolchain
  
  
 
 -- 
 Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
 050-2/C113  (507) 253-7520 home: 507/263-0782
 IBM Linux Technology Center - PPC Toolchain


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
 What I've observed on power is that LTO alone reduces performance and
 LTO+FDO is not significantly different than FDO alone.
 
 I agree that an exact estimate of the register pressure would be a
 difficult problem. I'm hoping that something that approximates potential
 register pressure downstream will be sufficient to help inlining
 decisions. 

One (ortoghonal) way to deal with this problem would be also to disable
inlining of functions called once when the edge frequency is low.
I.e. adding to check_callers something like
edge-frequency  CGRAPH_FREQ_BASE / 2
if you want to disqualify all calls that have only 50% chance that they
will be called during function invocation.

Does something like that help in your cases?

It would help in the case Linus complained about
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

The difficulty here is that disabling inlies on not so important paths
may prevent SRA and other optimizations so it may in turn also penalize
the hot path. I saw this in some cases where EH cleanup code was optimized
for size.

Perhaps SRA canalso be extended to handle cases where non-SRAable code is
on a cold path?

Honza


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote:
 What I've observed on power is that LTO alone reduces performance and
 LTO+FDO is not significantly different than FDO alone.
 On SPEC2k6?

 This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
 off-noise win on SPEC2k6
 http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
 http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html

 I do not see why PPC should be significantly more constrained by register
 pressure.

 I do not have head to head comparsion of FDO and FDO+LTO for SPEC
 http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
 shows noticeable drop in calculix and gamess.
 Martin profiled calculix and tracked it down to a loop that is not trained
 but hot in the reference run.  That makes it optimized for size.

 http://dromaeo.com/?id=219677,219672,219965,219877
 compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
 Here the benefits of LTO and FDO seems to add up nicely.

 I agree that an exact estimate of the register pressure would be a
 difficult problem. I'm hoping that something that approximates potential
 register pressure downstream will be sufficient to help inlining
 decisions.

 Yep, register pressure and I-cache overhead estimates are used for inline
 decisions by some compilers.

 I am mostly concerned about the metric suffering from GIGO principe if we mix
 together too many estimates that are somehwat wrong by their nature. This is
 why I mostly tried to focus on size/time estimates and not add too many other
 metrics. But perhaps it is a time to experiment wit these, since obviously we
 pushed current infrastructure to mostly to its limits.


I like the word GIGO here. Getting inline signals right  requires deep
analysis (including interprocedural analysis). Different signals/hints
may also come with different quality thus different weights.

Another challenge is how to quantify cycle savings/overhead more
precisely. With that, we can abandon the threshold based scheme -- any
callsite with a net saving will be considered.

David


 Honza

   Aaron

 On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
  Do you witness similar problems with LTO +FDO?
 
  My concern is it can be tricky to get the register pressure estimate
  right. The register pressure problem is created by downstream
  components (code motions etc) but only exposed by the inliner.  If you
  want to get it 'right' (i.e., not exposing the problems), you will
  need to bake the knowledge of the downstream components (possibly
  bugs) into the analysis which might not be a good thing to do longer
  term.
 
  David
 
  On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
  acsaw...@linux.vnet.ibm.com wrote:
   Honza,
 Seeing your recent patches relating to inliner heuristics for LTO, I
   thought I should mention some related work I'm doing.
  
   By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
   team, working on gcc performance.
  
   We have not generally seen good results using LTO on IBM power processors
   and one of the problems seems to be excessive inlining that results in 
   the
   generation of excessive spill code. So, I have set out to tackle this by
   doing some analysis at the time of the inliner pass to compute something
   analogous to register pressure, which is then used to shut down inlining 
   of
   routines that have a lot of pressure.
  
   The analysis is basically a liveness analysis on the SSA names per basic
   block and looking for the maximum number live in any block. I've been 
   using
   liveness pressure as a shorthand name for this.
  
   This can then be used in two ways.
   1) want_inline_function_to_all_callers_p at present always says to inline
   things that have only one call site without regard to size or what this 
   may
   do to the register allocator downstream. In particular, BZ2_decompress in
   bzip2 gets inlined and this causes the pressure reported downstream for 
   the
   int register class to increase 10x. Looking at some combination of 
   pressure
   in caller/callee may help avoid this kind of situation.
   2) I also want to experiment with adding the liveness pressure in the 
   callee
   into the badness calculation in edge_badness used by 
   inline_small_functions.
   The idea here is to try to inline functions that are less likely to cause
   register allocator difficulty downstream first.
  
   I am just at the point of getting a prototype working, I will get a patch
   you could take a look at posted next week. In the meantime, do you have 
   any
   comments or feedback?
  
   Thanks,
  Aaron
  
   --
   Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
   050-2/C113  (507) 253-7520 home: 507/263-0782
   IBM Linux Technology Center - PPC Toolchain
  
 

 --
 Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
 050-2/C113  (507) 253-7520 home: 507/263-0782
 

Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 12:51 PM, Jan Hubicka hubi...@ucw.cz wrote:
 What I've observed on power is that LTO alone reduces performance and
 LTO+FDO is not significantly different than FDO alone.

 I agree that an exact estimate of the register pressure would be a
 difficult problem. I'm hoping that something that approximates potential
 register pressure downstream will be sufficient to help inlining
 decisions.

 One (ortoghonal) way to deal with this problem would be also to disable
 inlining of functions called once when the edge frequency is low.
 I.e. adding to check_callers something like
 edge-frequency  CGRAPH_FREQ_BASE / 2
 if you want to disqualify all calls that have only 50% chance that they
 will be called during function invocation.

 Does something like that help in your cases?

 It would help in the case Linus complained about
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

 The difficulty here is that disabling inlies on not so important paths
 may prevent SRA and other optimizations so it may in turn also penalize
 the hot path. I saw this in some cases where EH cleanup code was optimized
 for size.


yes. The callsite may be cold, but the profile scaled callee body may
still be hot. Inlining the callee allows more context to be passed.
Similarly for hot callers, cold callees may also expose more
information (e.g, better alias info) to the caller.

David


 Perhaps SRA canalso be extended to handle cases where non-SRAable code is
 on a cold path?

 Honza


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Jan Hubicka
 On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote:
  What I've observed on power is that LTO alone reduces performance and
  LTO+FDO is not significantly different than FDO alone.
  On SPEC2k6?
 
  This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO 
  seems
  off-noise win on SPEC2k6
  http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
  http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
 
  I do not see why PPC should be significantly more constrained by register
  pressure.
 
  I do not have head to head comparsion of FDO and FDO+LTO for SPEC
  http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
  shows noticeable drop in calculix and gamess.
  Martin profiled calculix and tracked it down to a loop that is not trained
  but hot in the reference run.  That makes it optimized for size.
 
  http://dromaeo.com/?id=219677,219672,219965,219877
  compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
  Here the benefits of LTO and FDO seems to add up nicely.
 
  I agree that an exact estimate of the register pressure would be a
  difficult problem. I'm hoping that something that approximates potential
  register pressure downstream will be sufficient to help inlining
  decisions.
 
  Yep, register pressure and I-cache overhead estimates are used for inline
  decisions by some compilers.
 
  I am mostly concerned about the metric suffering from GIGO principe if we 
  mix
  together too many estimates that are somehwat wrong by their nature. This is
  why I mostly tried to focus on size/time estimates and not add too many 
  other
  metrics. But perhaps it is a time to experiment wit these, since obviously 
  we
  pushed current infrastructure to mostly to its limits.
 
 
 I like the word GIGO here. Getting inline signals right  requires deep
 analysis (including interprocedural analysis). Different signals/hints
 may also come with different quality thus different weights.
 
 Another challenge is how to quantify cycle savings/overhead more
 precisely. With that, we can abandon the threshold based scheme -- any
 callsite with a net saving will be considered.

Inline hints are intended to do this - at the moment we bump the limits up
when we estimate big speedups for the inlining and with today patch and FDO
we bypass the thresholds when we know from FDO that call matters.

Concerning your other email, indeed we should consider heavy callees (in Open64
terminology) that consume a lot of time and do not skip the call sites.  Easy
way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply
multiplies the count and estimated time.  (We probably gouth to get rid of the
time capping and use wider arithmetics too).

I wonder if that is not too local and if we should not try to estimate 
cumulative time
of the function and get more agressive on inlining over the whole path leading
to hot code.

Honza


Re: LTO inliner -- sensitivity to increasing register pressure

2014-04-18 Thread Xinliang David Li
On Fri, Apr 18, 2014 at 2:16 PM, Jan Hubicka hubi...@ucw.cz wrote:
 On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka hubi...@ucw.cz wrote:
  What I've observed on power is that LTO alone reduces performance and
  LTO+FDO is not significantly different than FDO alone.
  On SPEC2k6?
 
  This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO 
  seems
  off-noise win on SPEC2k6
  http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
  http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
 
  I do not see why PPC should be significantly more constrained by register
  pressure.
 
  I do not have head to head comparsion of FDO and FDO+LTO for SPEC
  http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
  shows noticeable drop in calculix and gamess.
  Martin profiled calculix and tracked it down to a loop that is not trained
  but hot in the reference run.  That makes it optimized for size.
 
  http://dromaeo.com/?id=219677,219672,219965,219877
  compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
  Here the benefits of LTO and FDO seems to add up nicely.
 
  I agree that an exact estimate of the register pressure would be a
  difficult problem. I'm hoping that something that approximates potential
  register pressure downstream will be sufficient to help inlining
  decisions.
 
  Yep, register pressure and I-cache overhead estimates are used for inline
  decisions by some compilers.
 
  I am mostly concerned about the metric suffering from GIGO principe if we 
  mix
  together too many estimates that are somehwat wrong by their nature. This 
  is
  why I mostly tried to focus on size/time estimates and not add too many 
  other
  metrics. But perhaps it is a time to experiment wit these, since obviously 
  we
  pushed current infrastructure to mostly to its limits.
 

 I like the word GIGO here. Getting inline signals right  requires deep
 analysis (including interprocedural analysis). Different signals/hints
 may also come with different quality thus different weights.

 Another challenge is how to quantify cycle savings/overhead more
 precisely. With that, we can abandon the threshold based scheme -- any
 callsite with a net saving will be considered.

 Inline hints are intended to do this - at the moment we bump the limits up
 when we estimate big speedups for the inlining and with today patch and FDO
 we bypass the thresholds when we know from FDO that call matters.

 Concerning your other email, indeed we should consider heavy callees (in 
 Open64
 terminology) that consume a lot of time and do not skip the call sites.  Easy
 way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply
 multiplies the count and estimated time.  (We probably gouth to get rid of the
 time capping and use wider arithmetics too).

That's what we did in Google branches. We had two heuristics -- hot
caller and hot callee heuristics.

1) For the hot caller heuristic, other simple analysis is checked a)
global working set size; b)  callsite argument check -- very simple
check to guess if inlining this callsite would sharpen analysis

2) We had not tuned hot callee heuristic by doing more analysis --
simply turn in on using hotness does not make a noticable differences.
Other hints are needed.

David




 I wonder if that is not too local and if we should not try to estimate 
 cumulative time
 of the function and get more agressive on inlining over the whole path leading
 to hot code.

 Honza