On Fri, Apr 30, 2010 at 12:07 PM, Xinliang David Li <davi...@google.com> wrote: > > On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubi...@ucw.cz> wrote: > >> > > >> > Interesting. My plan for profiling with LTO is to ultimately make it > >> > linktime > >> > transform. This will be more difficult with WHOPR (i.e. instrumenting > >> > need > >> > function bodies that are not available at WPA time), but I believe it is > >> > solvable: just assign uids to the edges and do instrumentation at > >> > ltrans. Then > >> > we will save cgraph profile in some easier way so WHOPR can read it in > >> > and read > >> > rest of stuff in ltrans. This would invovlve shipping the correct > >> > profiles for > >> > given function etc so it will be a bit of implementation challenge. > >> > >> This can be tricky -- to maximize FDO benefit, the > >> profile-use/annotation needs to happen early which means > >> instrumentation also needs to happen early (to avoid cfg mismatches). > > > > I don't see much problem in this particular area. > > > > GCC optimization queue is organized in a way that we first do early > > optimizatoins that all are intended to be simple cleanups without size/speed > > tradeoffs. Then we do IPA and late optimizations that are both driven by > > profile (estimated or read). > > Profile reading happens early because we use same infrastructure for gcov > > and > > profile feedback. This is not giving profile feedback better benefit, > > quite a > > converse since early passes may not be able to update profile precisely and > > we > > also get higher profile overhead. > > > > So I think decoupling gcov and profile feedback and pushing profile feedback > > back in queue is going to be win. > > > > There are two parts of profile-feedback > 1) cfg edge counts annotation. > > For this part, yes, most of the early phases (other than possibly > einline-2) do not need/depend on, and can probably pushed back (in > fact the static/guessed profile pass is later). > > 2) value profile transformations: > > This part may benefit more from doing early -- not only because of > more cleanups, but also due to the requirement for getting more > precise inline summary. > > > > Yes, optimization must match, but with LTO this is not problem and in > > general > > the early optimization should be stable wrt memory layout (nothing else > > changes). This used to be excercised before profiling was updated to tree > > level in 4.x. > > > You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion > etc all can change cfg. > > > > > I would be very interested in the low overhead support - there is a lot to > > gain > > especially because the profiling resuls are less dependent on setup and can > > be > > better reused. I know part of code was contributed (the support for > > reading not > > 100% valid profiles). Is there any extra info available on this? > > > > For profile smoothing, Neil may point to more information.
Sorry for the *very* delayed response, but some email filters went a bit wild. Profile smoothing does a good job of taking imprecise profiles and fixing them up. This doesn't address the stale profile problem with GCC instrumentation based FDO profile collection. There are checks which completely discard profiles if the function line numbers (IIRC) do not match. I have some patches I've been meaning to send upstream which help ease this restriction (i.e., add the ability to retain more of a stale profile), but this opens up many bugs which I've been incrementally squashing throughout the rest of the compiler. > > > Main problem IMO is how to get profile into WHOPR without having function > > bodies. > > I guess we will end up with summarizing the info in WHOR firendly way and > > letting it to stream the other counters to LTRANS that will annotate the > > function > > body once read in from the file. > >> > > I am a little lost here :) > > >> > >> > > >> >> 2) comdat function resolution -- since LIPO uses aux module functions > >> >> for inlining purpose only, it has the freedom to choose which copy to > >> >> use. The current scheme chooses copy in current module with priority > >> >> for better profile data context sensitivity (see below) > >> > > >> > This is interesting. How do you solve the problem when given comdat > >> > function > >> > "loose"? I.e. it is replaced at linktime by other function that may or > >> > may > >> > not be profiled from other unit? > >> > >> Whatever function that is selected will have profile data (assuming it > >> called at runtime) -- but the profile data are merged from different > >> contexts including from calls in different modules. For instance, > >> both a.C and b.C define foo. and b.C:foo is selected at runtime, and > >> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then > >> a.C:foo won't have any profile data, and b.C:foo has merged profile > >> data resulting from calls in both a.C and b.C. > > > > Yes, but this is what I am concerned about. Without LTO at least when > > compiling a.C with profile feedback we will have foo with 0 counts. > > We might however work out that calls of foo are frequent and decide to > > inline foo. We will take the counts and rescale resulting in inlining > > foo optimized for size > > Not always ideal though -- scaling does not expose whether foo is hot > or not (the call edge may be cold, but is still worth inlining). > > . > > > > When comdats are resolved within LTO, this will not be deal, but LTO > > still produce comdats that are later resolved with library etc., so we don't > > solve the problem this way. > > At very least we should be able to figure out that we are having function > > that has no profile and do something more sane. > > You mean LTO does not discard duplicate bodies? Why ? > > > > > Do you have any idea how common these scenarios are? > > I don't have direct data, but I think it can be common. > > Thanks, > > David > > > > > Honza > > -- Neil Vachharajani Google 650-214-1804