I seem to have inadvertantly annoyed Steven Bosscher on IRC, so by way of an apology and explanation I thought I'd post my current opinion and thinking on the optimizations performed by GCC's middle-end both at the tree-level in the tree-ssa optimizers and at the RTL-level in the exisiting RTL-optimizers. The intention is to (i) explain my current position, understanding and misunderstandings and (ii) for people to disagree and/or convince me otherwise.
My current opinion is that for the medium-term, let's arbitrarily say prior to GCC 5.x, GCC will continue to need to perform optimizations at both the tree-level and the RTL-level. The tree-level, I see as fundamentally responsible for high-level and machine independent transformations, and the RTL optimizers for the lower-level machine-dependent transformations. It's unlikely that RTL, or an instruction-based, representation of a program will ever go away. And the RTL passes corresponding to instruction selection (combine), register allocation (reload) and scheduling will almost certainly be RTL-level tasks/passes. But the advent of the tree-ssa infrastructure, the other RTL "optimization" passes, notably CSE, GCSE, loop, if-conversion, and jump bypassing may not be long for this world. And certainly not preserved in their current incarnations. The difficulty here is that by expressing a program as low-level insns, we expose optimization opportunities not visible as trees. The RTL expanders and backend's pattern expanders inadvertantly introduce redundancies and inefficiencies, or more commonly by providing machine-specific idioms open up code transformations, for example with SUBREGs, that just didn't occur previously. For example, expand_mult may produce a sequence of shifts and additions for an integer multiplication. If there are two integer multiplications by a constant in a basic block, CSE will be able to identify shared terms between their shift/add sequences that aren't visible to the tree-level multiplications. One might argue that the shifts/adds should be exposed at the tree-level. But not only does this require detailed knowledge of the instructions available on a machine, but it would also obfuscate the multiplication from the tree-ssa optimizers, such as iv-opts, etc... Instead, representing the multiplication at the tree-level as a MULT_EXPR, propagating constants into it, and letting expand do it's thing gets the best of both worlds. This doesn't constrain us to keep all of GCC's original RTL optimizers though. Unlike gcc 3.x and earlier, we now expect tree-ssa to do most of the heavy lifting. One goal is to have the initial RTL that we generate to be pretty close to it's final form. Perhaps only requiring minor clean-ups. I forsee a world in which RTL transformations become far more local. Perhaps the RTL-level loop optimizers will disappear completely. Although, RTL expansion may introduce new loops, these tend to be rare, and the expanders have all the information they need to hoist/sink invariant expressions and unroll/peel themselves. Certainly there should be no need for RTL-level loop optimizations to do loop unrolling or other large scale reorganization. Simiarly, CSE shouldn't need to process more than a single basic blocks, and GCSE shouldn't need to move anything other than simple expressions. The quality of alias analysis at the RTL-level shouldn't be an issue. This brave new world also has a profound impact on how backends are written. Previously, because with the exception of "fold" and "expand", all optimization was performed on the RTL-level, backend's played tricks to virtualize their hardware to the middle-end in an attempt to allow the RTL optimizers to understand what was going on. Virtual high-level patterns would be introduced for the benefit of the RTL optimizers, only to be split later. The problem with this is that by splitting/lowering after the RTL optimizers any inefficiencies that are introduced are too late to be cleaned up. Perhaps one of the most notorious examples of this is the handling of DImode on IA-32. The i386.c backend pretends that it has native DImode support prior to reload. Whilst in gcc 3.x, this is essential, constant propagation and constant folding wouldn't work with highparts and lowparts, and for example, the loop analysis would be unable to do anything with a loop using a "long long" counter once the loop exit test has been decomposed into highpart and lowpart subtests. Unfortunately, the late splitting leads to terrible register allocation, and numerous dead loads/stores to/from memory. SPECint's crafty suffers. With tree-ssa, however, these design decisions can and need to be re-evaluated. Now that a "long long" for-loop will be reversed at tree-level, the x86 backend is far better of decomposing DImode early, and getting the benefit of RTL optimizers to eliminate the dead/load stores etc... One side-effect of such a move is that although we generate better code on IA-32, we've again slightly increased our dependence on the clean-up optimizations performed by the RTL optimizers. Given that many of GCC's RTL optimizer are therefore destined to be deleted or gutted and stripped to their bare minimum, it makes sense to think about the way this is done. It makes sense therefore to think about how they are organized. Both at the tree-level and RTL, great efforts have been made to centralize expression-level transformations into shared routines. At the tree-level, most transformations are performed by "fold" and at the RTL-level there have been effort to centralize this intelligence in "simplify_rtx". These routines are common across many optimizers in their class, they represent the a fundamental "engine" of a compiler's code transformations. I'm going to go a bit wierd now and claim that its this set of transformations that first got me interested in contributing to GCC. The one advantage of a 20 year-old open source compiler over other compilers is the encyclopedic set of transformations and optimizations that its acquired over the years. Whilst to many maintainers to infathomable size and complexity of "fold" was a burden, to me it was a large asset (or inheritance). The rules that are encoded are rarely if ever described in compiler texts, and accumulating them is a worthy investment. For me, an extremely interesting post on the topic was Dan Berlin's http://gcc.gnu.org/ml/gcc/2001-05/msg00445.html. The benefit of being developed by a community is that these optimizations accumulate over time, continually upping the ante. There may have been many clever hacker tricks in Digital's VAX compilers, but to all intents and purposes they died when the code died. Back to reality and a bit of history. Of course, things weren't always so simple. Once upon a time, this knowledge was randomly distributed throughout the compiler that had now less than three or four RTL simplification routines. The comment in simplify-rtx.c explains the situation: > This is the preferred entry point into the simplification routines; > however, we still allow passes to call the more specific routines. > > Right now GCC has three (yes, three) major bodies of RTL simplification > code that need to be unified. > > 1. fold_rtx in cse.c. This code uses various CSE specific > information to aid in RTL simplification. > > 2. simplify_rtx in combine.c. Similar to fold_rtx, except that > it uses combine specific information to aid in RTL > simplification. > > 3. The routines in this file. > > Long term we want to only have one body of simplification code; to > get to that state I recommend the following steps: > > 1. Pour over fold_rtx & simplify_rtx and move any simplifications > which are not pass dependent state into these routines. > > 2. As code is moved by #1, change fold_rtx & simplify_rtx to > use this routine whenever possible. > > 3. Allow for pass dependent state to be provided to these > routines and add simplifications based on the pass dependent > state. Remove code from cse.c & combine.c that becomes > redundant/dead. > > It will take time, but ultimately the compiler will be easier to > maintain and improve. It's totally silly that when we add a > simplification that it needs to be added to 4 places (3 for RTL > simplification and 1 for tree simplification. */ Whilst I have no issue with deleting cse.c in its entirety. I believe that the majority of it's interesting transformations are encoded in cselib.c, and GCC's GCSE pass performs a cselib sweep over instructions whilst doing its thing. I do think that removing it completely before we can salvage what we can would be throwing the baby out with the bath water. The reason why we can even talk about removing CSE is because so much effort has gone into centralizing transformations so they can be performed by cselib, GCSE and combine. Whilst performing operations in narrower modes that word-mode might seem like an unnecessary optimization or one that's already performed at the tree-level, there are even several open PRs about them. One that affects the AVR target for example, is that the comparison "(lt (sign_extend:HI foo:QI) (const_int 0))" isn't currently optmized to "(lt (foo:QI) (const_int 0))". This is also an interesting case where the word_mode and sizeof(int) visible to the tree-ssa optimizers don't accurately reflect to size of a register in the backend. In all of your testing on s390*, ppc, ppc64, i386, i686, amd64, alpha, hppa and ia64, you never tried an 8bit or 16bit target for which this transformation may be more than just "nice". Notice also there is nothing CSE specific about the code that is being deleted. Just because we want CSE to go away, shouldn't affect whether this micro-optimization can be performed by GCC's RTL optimizers or not. So to clarify some points, I feel that the transformations that are peformed by fold and those performed by simplify_rtx should be synchronized. Not only by moving from RTL-level to tree-level, but also moving from tree-level to RTL. For the immediate future, GCC has a clear need of both *infrastructures*, even though some of the passes built on them need to go. For example, "(x >> 3) << 3" is transformed to "x & ~7" in simplify_rtx, but isn't yet simplified in fold. During the transition period, I think that new optimizations are best added whereever appropriate. If adding loop prefetches at the RTL-level or tree-level, so be it. Likewise for adding if-conversions to either if-cvt.c or tree-if-conv.c. GCC's development effort is based upon contributions, and if people/companies feel that achieving goal X is more appropriate in infrastructure Y, then it's their call. If Z feels this conflicts with Z's goals, Z should work harder. Ultimately, the competition between the two approaches will be decided and one will succeed and the other be removed. I can understand why this can be frustrating. Just wanting things to happen isn't sufficient. When Canqun Yang submits a patch to the old RTL loop optimizer that gives a 2.5% improvement on IA-64 SPECfp it appears on the surface to be a step in the wrong direction. Of course, this is competition and it's usually healthy. What it does reveal is that (i) the old RTL loop optimizer did a better job, and was easier to modify for this task than the new one, and (ii) all attempts to move prefetching to the tree-level have (alledgedly) done a worse job than the current aging implementation. Interestingly, though all of Canqun's changes where generic tweaks to the way prefetching is done, and are applicable to the new RTL loop optimizer with almost no changes. I believe that tree-ssa is ultimately a superior optimization technology which will obviate the need for many of GCC's RTL optimizers. Of course, the situation at the moment/recently is that even with all of the new optimizers *and* with all of the old optimizers, GCC 4.x typically produces code of about the same quality, sometimes much better, sometimes worse. That's even with the help all of the old optimizers. Just switching off RTL optimizations wholesale might not yet be on the map. Certainly the belief that there's no RTL vs. tree-ssa argument to be had is misguided or premature. Neither approach is vastly or even clearly superior to the other. I'd like to think that the reason performance is the same is mostly due to duplicated effort, but with rival compilers still being significantly ahead on benchmarks, clearly there's still room for improvement. What I should say to soften the above, is that the fact that compile-time has only regressed slightly is perhaps a *miracle* of tree-ssa. For a fairer comparison or head-to-head tree-ssa should be allowed to consume far more time and expend far more effort. Alternatively, if tree-ssa generates RTL that requires far less effort of the RTL optimizers, such that the CPU time overheads of RTL passes have shrunk dramatically then perhaps there's less pressure to remove them as they are no longer the compile-time hogs they used to be. As for Paolo's patch. I'm still reviewing it. I've been off-work sick for the past few days, and as explained above investigating and understanding the impact of what look like simple changes requires far more effort than simply deleting a few lines. In fact, this form of patch is probably harder to review than it is to write. As for code coverage arguments, I've argued before that whilst they can be used to provide convincing supporting evidence, for some aspects of the compiler they can't by their nature be sufficient for some things. Yes, they demonstrate it safe. Indeed, if I didn't care so much about the quality of code that GCC generates I could just say "OK". Instead, I need to convince myself either that these transformations aren't worth the efforts or (ii) can or should be performed elsewhere at the RTL-level. Ultimately, I may decide the patch is perfect as written. Removing code from the compiler is always harder than adding it. RTL by its very nature is platform specific (c.f. cc0 etc...), and the mind boggling number of command line options and input files, means that DO-178B style coverage of GCC source isn't likely within my life time. Let's face it, even if this transformation no longer could ever be triggered where it currently is, it was obviously useful when it was written and added to the compiler, and could potentially be profitable somewhere else. Being a GCC maintainer is difficult. If simply pushing patches on a RedHat or SuSE build farm was sufficient, life would be much simpler. Instead, because automated testing of the possible interactions can never be sufficient, our development practices rely on intelligent individuals to give each change the consideration it deserves. The fact that I've been slow in doing so (or that nobody else has reviewed it) is an issue. But to claim that such diligence is unnecessary is one of the things that distinguishes maintainers from contributors. Perhaps I just think, and care, and worry too much. [All comments and criticisms welcome. Just because some of my points are well reasoned and argued, doesn't mean that I'm completely out-of-my-mind insane on others]. Roger --