On Tue, Jul 03, 2012 at 02:45:29PM +0200, Tobias Grosser wrote: > On 07/03/2012 02:24 PM, Richard Guenther wrote: >> On Tue, 3 Jul 2012, Tobias Grosser wrote: >> >>> On 07/03/2012 01:56 PM, Richard Guenther wrote: >>>> On Tue, 3 Jul 2012, Tobias Grosser wrote: >>>> >>>>> On 07/03/2012 01:15 PM, Richard Guenther wrote: >>>>>> >>>>>> This merges the last bit from the graphite ISL branch - an >>>>>> integrated optimizer based on ISL. To quote Tobias: >>>>>> >>>>>> "The isl scheduling optimizer implements the scheduling algorithm first >>>>>> developed in Pluto [1]. Pluto has shown significant speedups and is >>>>>> nowadays even implemented in the IBM XL-C compiler. The implementation >>>>>> of >>>>>> this pass is a first draft and was copied largely from Polly. We need >>>>>> still to adapt the code to the gcc coding style and we need to tune the >>>>>> isl scheduler. At the moment we get reasonable compile times (at most >>>>>> 2x-3x slowdown) and first speedups. We now need to tune the compile >>>>>> time >>>>>> and start to investigate which optimizations and heuristics need to be >>>>>> tuned in our reimplementation. >>>>>> >>>>>> [1] http://pluto-compiler.sourceforge.net/" >>>>>> >>>>>> Micha kindly did the code adaption to gcc coding style and I renamed >>>>>> the flag to -floop-nest-optimize (from -floop-optimize-isl). We >>>>>> both agree that such integrated LNO is the way to go, superseeding >>>>>> individual graphite transforms we have now. We might be even able >>>>>> to drop the separate blocking& strip-mining transforms we have >>>>>> right now in favor of this? >>>>> >>>>> Thanks Micha for adapting the style to gcc. >>>>> >>>>> I would like to point out that this pass is still very experimental and >>>>> not >>>>> tuned at all. Specifically, it was only tested on polybench with one >>>>> specific >>>>> set of flags. Even there we did not only get speedups, but due to missing >>>>> heuristics some benchmarks also got large slowdowns. When using it on even >>>>> slightly different benchmarks or with slightly different flags, infinite >>>>> compile time or large performance regressions may show up! This optimizer >>>>> may >>>>> obviously also contain bugs that yield to miscompiles. >>>>> >>>>> Also, the loop nest optimizer will be not very effective, as long as pre >>>>> and >>>>> licm are scheduled before graphite. >>>> >>>> I have noticed the change to disable those on the graphite-isl branch, >>>> but I fail to see how we can not handle PREd/LIMd loops from within >>>> polyhedral optimizations. In fact even the user may have performed >>>> PRE or LIM at the source level, thus the point to address this issue >>>> is surely within graphite/ISL. >>> >>> You can still handle those loops, however the PREd/LIMD will introduce a lot >>> of additional dependences, which will block transformations. Such >>> dependences >>> can be removed e.g. with array expansion or undoing some of the PREd/LIMD >>> transformations, but this is a difficult problem especially if you don't >>> want >>> to increase the used memory too much. >>> >>> I do not see any benefits from PREd and LIMD before graphite, as at the very >>> best their transformations will be reverted (if such an optimization is ever >>> written). However, scheduling PREd and LIMD after graphite makes perfect >>> sense. (Especially as there are probably a lot of new opportunities). >>> >>> Moving such passes behind graphite, does obviously not solve the case of >>> manual PRE and LIMD done by the user. However, it will allow us to optimize >>> the non manually PREd code a lot better. >> >> I suppose it might make sense to not perform GRAPHITE stuff from within >> our main loop optimization pipeline. > > I have no idea. In general all passes that help scalar evolution and > that canonicalize the code are good to execute before graphite. > > It may make sense to create a new block of loop optimizations with the > preparing transformations + graphite, before the normal loop > optimizations. This makes sense in general, as graphite only does high > level optimizations and the resulting code should anyway have normal > low-level loop optimizations applied afterwards. Also, this would keep > the schedule identical if graphite is not enabled, and would only add > compile time overhead with graphite enabled. > >> Are there any passes that benefit >> GRAPHITE that currently run before it from inside loop? >> >> NEXT_PASS (pass_tree_loop); >> { >> struct opt_pass **p =&pass_tree_loop.pass.sub; >> NEXT_PASS (pass_tree_loop_init); >> NEXT_PASS (pass_lim); >> NEXT_PASS (pass_copy_prop); >> NEXT_PASS (pass_dce_loop); >> NEXT_PASS (pass_tree_unswitch); >> NEXT_PASS (pass_scev_cprop); >> NEXT_PASS (pass_record_bounds); >> NEXT_PASS (pass_check_data_deps); >> NEXT_PASS (pass_loop_distribution); >> NEXT_PASS (pass_copy_prop); >> NEXT_PASS (pass_graphite); >> >> I suppose scev_cprop helps because it removes scalar dependences >> from outside of the loops, unswitch might help because graphite >> does not handle conditional code (? then if-conversion would help, too). > > Sounds reasonable. I did not investigate this yet. > >> I would say invariant motion from LIM does not hurt either, it is probably >> store-motion that does (it introduces scalar dependences from outside >> of the loop). >> >> We've already moved the PRE-like predictive commoning after loop >> optimizations, and PRE has several measures to not introduce >> dependences that would confuse vectorization. The vectorizer only >> handles scalar reductions, so it requires store-motion done by LIM. >> The vectorizer meanwhile handles invariant loads just fine though >> (I'll look if we can handle stores to invariant locations, too). > > I think that needs some further investigation. It may make sense to try > e.g. 2mm from the polybench and look for a pass sequence that allows the > isl optimizer to jump in. Unfortunately I don't have time to try it > myself, but I am obviously available in case you need here. (You need to > take aliasing into account and probably want to use C99 arrays) > >> Re-ordering passes is always something tedious though ;) > > Yes. The idea I mentioned above should not change the default pass ordering. > > Cheers > Tobi
On a slightly different topic, I thought there was a plan at one point to make -ftree-loop-linear the default for -O3 when gcc was built with graphite support. Has this idea been abandoned? Jack