On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> Teresa has done some tunings for the unroller so far. The inliner >> tuning is the next step. >> >> > >> > What concerns me that it is greatly inaccurate - you have no idea how many >> > instructions given counter is guarding and it can differ quite a lot. Also >> > inlining/optimization makes working sets significantly different (by >> > factor of >> > 100 for tramp3d). >> >> The pre ipa-inline working set is the one that is needed for ipa >> inliner tuning. For post-ipa inline code increase transformations, >> some update is probably needed. >> >> >But on the ohter hand any solution at this level will be >> > greatly inaccurate. So I am curious how reliable data you can get from >> > this? >> > How you take this into account for the heuristics? >> >> This effort is just the first step to allow good heuristics to develop. >> >> > >> > It seems to me that for this use perhaps the simple logic in histogram >> > merging >> > maximizing the number of BBs for given bucket will work well? It is >> > inaccurate, but we are working with greatly inaccurate data anyway. >> > Except for degenerated cases, the small and unimportant runs will have >> > small BB >> > counts, while large runs will have larger counts and those are ones we >> > optimize >> > for anyway. >> >> The working set curve for each type of applications contains lots of >> information that can be mined. The inaccuracy can also be mitigated by >> more data 'calibration'. > > Sure, I think I am leaning towards trying the solution 2) with maximizing > counter count merging (probably it would make sense to rename it from BB count > since it is not really BB count and thus it is misleading) and we will see how > well it works in practice. > > We have benefits of much fewer issues with profile locking/unlocking and we > lose bit of precision on BB counts. I tend to believe that the error will not > be that important in practice. Another loss is more histogram streaming into > each gcda file, but with skiping zero entries it should not be major overhead > problem I hope. > > What do you think? >> >> >> >> >> >> >> > 2) Do we plan to add some features in near future that will anyway >> >> > require global locking? >> >> > I guess LIPO itself does not count since it streams its data into >> >> > independent file as you >> >> > mentioned earlier and locking LIPO file is not that hard. >> >> > Does LIPO stream everything into that common file, or does it use >> >> > combination of gcda files >> >> > and common summary? >> >> >> >> Actually, LIPO module grouping information are stored in gcda files. >> >> It is also stored in a separate .imports file (one per object) --- >> >> this is primarily used by our build system for dependence information. >> > >> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO >> > behave >> > on GCC bootstrap? >> >> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> main problem for application build -- the link time (for debug build) >> is. > > I was primarily curious how the LIPOs runtime analysis fare in the situation > where > you do very many small train runs on rather large app (sure GCC is small to > google's > use case ;).
There will be race, but as Teresa mentioned, there is a big chance that the process which finishes the merge the last is also t the final overrider of the LIPO summary data. >> >> > (i.e. it does a lot more work in the libgcov module per each >> > invocation, so I am curious if it is practically useful at all). >> > >> > With LTO based solution a lot can be probably pushed at link time? Before >> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs >> > from >> > gcda files and do all the merging/updating/CFG constructions that is >> > currently >> > performed at runtime, right? >> >> The dynamic cgraph build and analysis is still done at runtime. >> However, with the new implementation, FE is no longer involved. Gcc >> driver is modified to understand module grouping, and lto is used to >> merge the streamed output from aux modules. > > I see. Are there any fundamental reasons why it can not be done at link-time > when all gcda files are available? For build parallelism, the decision should be made as early as possible -- that is what makes LIPO 'light'. > Why the grouping is not done inside linker > plugin? It is not delayed into link time. In fact linker plugin is not even involved. David > > Honza >> >> >> David