Re: [PATCH] Add working-set size and hotness information to fdo summary (issue6465057)

Xinliang David Li Tue, 21 Aug 2012 10:18:10 -0700

On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <[email protected]> wrote:
>> Teresa has done some tunings for the unroller so far. The inliner
>> tuning is the next step.
>>
>> >
>> > What concerns me that it is greatly inaccurate - you have no idea how many
>> > instructions given counter is guarding and it can differ quite a lot. Also
>> > inlining/optimization makes working sets significantly different (by 
>> > factor of
>> > 100 for tramp3d).
>>
>> The pre ipa-inline working set is the one that is needed for ipa
>> inliner tuning. For post-ipa inline code increase transformations,
>> some update is probably needed.
>>
>> >But on the ohter hand any solution at this level will be
>> > greatly inaccurate. So I am curious how reliable data you can get from 
>> > this?
>> > How you take this into account for the heuristics?
>>
>> This effort is just the first step to allow good heuristics to develop.
>>
>> >
>> > It seems to me that for this use perhaps the simple logic in histogram 
>> > merging
>> > maximizing the number of BBs for given bucket will work well?  It is
>> > inaccurate, but we are working with greatly inaccurate data anyway.
>> > Except for degenerated cases, the small and unimportant runs will have 
>> > small BB
>> > counts, while large runs will have larger counts and those are ones we 
>> > optimize
>> > for anyway.
>>
>> The working set curve for each type of applications contains lots of
>> information that can be mined. The inaccuracy can also be mitigated by
>> more data 'calibration'.
>
> Sure, I think I am leaning towards trying the solution 2) with maximizing
> counter count merging (probably it would make sense to rename it from BB count
> since it is not really BB count and thus it is misleading) and we will see how
> well it works in practice.
>
> We have benefits of much fewer issues with profile locking/unlocking and we
> lose bit of precision on BB counts. I tend to believe that the error will not
> be that important in practice. Another loss is more histogram streaming into
> each gcda file, but with skiping zero entries it should not be major overhead
> problem I hope.
>
> What do you think?
>>
>> >>
>> >>
>> >> >  2) Do we plan to add some features in near future that will anyway 
>> >> > require global locking?
>> >> >     I guess LIPO itself does not count since it streams its data into 
>> >> > independent file as you
>> >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >     Does LIPO stream everything into that common file, or does it use 
>> >> > combination of gcda files
>> >> >     and common summary?
>> >>
>> >> Actually, LIPO module grouping information are stored in gcda files.
>> >> It is also stored in a separate .imports file (one per object) ---
>> >> this is primarily used by our build system for dependence information.
>> >
>> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO 
>> > behave
>> > on GCC bootstrap?
>>
>> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> main problem for application build -- the link time (for debug build)
>> is.
>
> I was primarily curious how the LIPOs runtime analysis fare in the situation 
> where
> you do very many small train runs on rather large app (sure GCC is small to 
> google's
> use case ;).



There will be race, but as Teresa mentioned, there is a big chance
that the process which finishes the merge the last is also t the final
overrider of the LIPO summary data.


>>
>> > (i.e. it does a lot more work in the libgcov module per each
>> > invocation, so I am curious if it is practically useful at all).
>> >
>> > With LTO based solution a lot can be probably pushed at link time? Before
>> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs 
>> > from
>> > gcda files and do all the merging/updating/CFG constructions that is 
>> > currently
>> > performed at runtime, right?
>>
>> The dynamic cgraph build and analysis is still done at runtime.
>> However, with the new implementation, FE is no longer involved. Gcc
>> driver is modified to understand module grouping, and lto is used to
>> merge the streamed output from aux modules.
>
> I see. Are there any fundamental reasons why it can not be done at link-time
> when all gcda files are available?

For build parallelism, the decision should be made as early as
possible -- that is what makes LIPO 'light'.

> Why the grouping is not done inside linker
> plugin?

It is not delayed into link time. In fact linker plugin is not even involved.

David


>
> Honza
>>
>>
>> David

Re: [PATCH] Add working-set size and hotness information to fdo summary (issue6465057)

Reply via email to