My opinions on tree-level and RTL-level optimization

Roger Sayle Sat, 16 Apr 2005 11:18:27 -0700

I seem to have inadvertantly annoyed Steven Bosscher on IRC,
so by way of an apology and explanation I thought I'd post my current
opinion and thinking on the optimizations performed by GCC's middle-end
both at the tree-level in the tree-ssa optimizers and at the
RTL-level in the exisiting RTL-optimizers.  The intention is to
(i) explain my current position, understanding and misunderstandings
and (ii) for people to disagree and/or convince me otherwise.



My current opinion is that for the medium-term, let's arbitrarily
say prior to GCC 5.x, GCC will continue to need to perform optimizations
at both the tree-level and the RTL-level.  The tree-level, I see as
fundamentally responsible for high-level and machine independent
transformations, and the RTL optimizers for the lower-level
machine-dependent transformations.

It's unlikely that RTL, or an instruction-based, representation of
a program will ever go away.  And the RTL passes corresponding to
instruction selection (combine), register allocation (reload) and
scheduling will almost certainly be RTL-level tasks/passes.

But the advent of the tree-ssa infrastructure, the other RTL
"optimization" passes, notably CSE, GCSE, loop, if-conversion, and
jump bypassing may not be long for this world.  And certainly not
preserved in their current incarnations.


The difficulty here is that by expressing a program as low-level
insns, we expose optimization opportunities not visible as trees.
The RTL expanders and backend's pattern expanders inadvertantly
introduce redundancies and inefficiencies, or more commonly by
providing machine-specific idioms open up code transformations,
for example with SUBREGs, that just didn't occur previously.  For
example, expand_mult may produce a sequence of shifts and additions
for an integer multiplication.  If there are two integer
multiplications by a constant in a basic block, CSE will be able
to identify shared terms between their shift/add sequences that
aren't visible to the tree-level multiplications.

One might argue that the shifts/adds should be exposed at the
tree-level.  But not only does this require detailed knowledge
of the instructions available on a machine, but it would also
obfuscate the multiplication from the tree-ssa optimizers, such
as iv-opts, etc...  Instead, representing the multiplication at
the tree-level as a MULT_EXPR, propagating constants into it,
and letting expand do it's thing gets the best of both worlds.


This doesn't constrain us to keep all of GCC's original RTL
optimizers though.  Unlike gcc 3.x and earlier, we now expect
tree-ssa to do most of the heavy lifting.  One goal is to have
the initial RTL that we generate to be pretty close to it's
final form.  Perhaps only requiring minor clean-ups.

I forsee a world in which RTL transformations become far more
local.  Perhaps the RTL-level loop optimizers will disappear
completely.  Although, RTL expansion may introduce new loops,
these tend to be rare, and the expanders have all the information
they need to hoist/sink invariant expressions and unroll/peel
themselves.  Certainly there should be no need for RTL-level
loop optimizations to do loop unrolling or other large scale
reorganization.

Simiarly, CSE shouldn't need to process more than a single
basic blocks, and GCSE shouldn't need to move anything other
than simple expressions.  The quality of alias analysis at
the RTL-level shouldn't be an issue.


This brave new world also has a profound impact on how backends
are written.  Previously, because with the exception of "fold"
and "expand", all optimization was performed on the RTL-level,
backend's played tricks to virtualize their hardware to the
middle-end in an attempt to allow the RTL optimizers to understand
what was going on.   Virtual high-level patterns would be introduced
for the benefit of the RTL optimizers, only to be split later.  The
problem with this is that by splitting/lowering after the RTL
optimizers any inefficiencies that are introduced are too late
to be cleaned up.  Perhaps one of the most notorious examples of
this is the handling of DImode on IA-32.  The i386.c backend
pretends that it has native DImode support prior to reload.
Whilst in gcc 3.x, this is essential, constant propagation and
constant folding wouldn't work with highparts and lowparts, and
for example, the loop analysis would be unable to do anything
with a loop using a "long long" counter once the loop exit test
has been decomposed into highpart and lowpart subtests.  Unfortunately,
the late splitting leads to terrible register allocation, and numerous
dead loads/stores to/from memory.  SPECint's crafty suffers.

With tree-ssa, however, these design decisions can and need to be
re-evaluated.  Now that a "long long" for-loop will be reversed at
tree-level, the x86 backend is far better of decomposing DImode
early, and getting the benefit of RTL optimizers to eliminate the
dead/load stores etc...

One side-effect of such a move is that although we generate better
code on IA-32, we've again slightly increased our dependence on the
clean-up optimizations performed by the RTL optimizers.



Given that many of GCC's RTL optimizer are therefore destined to
be deleted or gutted and stripped to their bare minimum, it makes
sense to think about the way this is done.  It makes sense therefore
to think about how they are organized.

Both at the tree-level and RTL, great efforts have been made to
centralize expression-level transformations into shared routines.
At the tree-level, most transformations are performed by "fold"
and at the RTL-level there have been effort to centralize this
intelligence in "simplify_rtx".  These routines are common across
many optimizers in their class, they represent the a fundamental
"engine" of a compiler's code transformations.


I'm going to go a bit wierd now and claim that its this set of
transformations that first got me interested in contributing to
GCC.  The one advantage of a 20 year-old open source compiler
over other compilers is the encyclopedic set of transformations
and optimizations that its acquired over the years.  Whilst to
many maintainers to infathomable size and complexity of "fold"
was a burden, to me it was a large asset (or inheritance).  The
rules that are encoded are rarely if ever described in compiler
texts, and accumulating them is a worthy investment.  For me,
an extremely interesting post on the topic was Dan Berlin's
http://gcc.gnu.org/ml/gcc/2001-05/msg00445.html.  The benefit
of being developed by a community is that these optimizations
accumulate over time, continually upping the ante.  There may
have been many clever hacker tricks in Digital's VAX compilers,
but to all intents and purposes they died when the code died.



Back to reality and a bit of history.  Of course, things weren't
always so simple.  Once upon a time, this knowledge was randomly
distributed throughout the compiler that had now less than three
or four RTL simplification routines.  The comment in simplify-rtx.c
explains the situation:


>  This is the preferred entry point into the simplification routines;
>  however, we still allow passes to call the more specific routines.
>
>  Right now GCC has three (yes, three) major bodies of RTL simplification
>  code that need to be unified.
>
>       1. fold_rtx in cse.c.  This code uses various CSE specific
>          information to aid in RTL simplification.
>
>       2. simplify_rtx in combine.c.  Similar to fold_rtx, except that
>          it uses combine specific information to aid in RTL
>          simplification.
>
>       3. The routines in this file.
>
>  Long term we want to only have one body of simplification code; to
>  get to that state I recommend the following steps:
>
>       1. Pour over fold_rtx & simplify_rtx and move any simplifications
>          which are not pass dependent state into these routines.
>
>       2. As code is moved by #1, change fold_rtx & simplify_rtx to
>          use this routine whenever possible.
>
>       3. Allow for pass dependent state to be provided to these
>          routines and add simplifications based on the pass dependent
>          state.  Remove code from cse.c & combine.c that becomes
>          redundant/dead.
>
>   It will take time, but ultimately the compiler will be easier to
>   maintain and improve.  It's totally silly that when we add a
>   simplification that it needs to be added to 4 places (3 for RTL
>   simplification and 1 for tree simplification.  */


Whilst I have no issue with deleting cse.c in its entirety.  I believe
that the majority of it's interesting transformations are encoded in
cselib.c, and GCC's GCSE pass performs a cselib sweep over instructions
whilst doing its thing.  I do think that removing it completely before
we can salvage what we can would be throwing the baby out with the
bath water.  The reason why we can even talk about removing CSE is
because so much effort has gone into centralizing transformations so
they can be performed by cselib, GCSE and combine.

Whilst performing operations in narrower modes that word-mode might
seem like an unnecessary optimization or one that's already performed
at the tree-level, there are even several open PRs about them.  One
that affects the AVR target for example, is that the comparison
"(lt (sign_extend:HI foo:QI) (const_int 0))" isn't currently
optmized to "(lt (foo:QI) (const_int 0))".  This is also an interesting
case where the word_mode and sizeof(int) visible to the tree-ssa
optimizers don't accurately reflect to size of a register in the
backend.  In all of your testing on s390*, ppc, ppc64, i386, i686,
amd64, alpha, hppa and ia64, you never tried an 8bit or 16bit
target for which this transformation may be more than just "nice".


Notice also there is nothing CSE specific about the code that is
being deleted.  Just because we want CSE to go away, shouldn't
affect whether this micro-optimization can be performed by GCC's
RTL optimizers or not.

So to clarify some points, I feel that the transformations that
are peformed by fold and those performed by simplify_rtx should be
synchronized.  Not only by moving from RTL-level to tree-level, but
also moving from tree-level to RTL.  For the immediate future, GCC
has a clear need of both *infrastructures*, even though some of the
passes built on them need to go.  For example, "(x >> 3) << 3"
is transformed to "x & ~7" in simplify_rtx, but isn't yet simplified
in fold.  During the transition period, I think that new optimizations
are best added whereever appropriate.  If adding loop prefetches at
the RTL-level or tree-level, so be it.  Likewise for adding
if-conversions to either if-cvt.c or tree-if-conv.c.  GCC's development
effort is based upon contributions, and if people/companies feel that
achieving goal X is more appropriate in infrastructure Y, then it's
their call.  If Z feels this conflicts with Z's goals, Z should
work harder.  Ultimately, the competition between the two approaches
will be decided and one will succeed and the other be removed.

I can understand why this can be frustrating.  Just wanting things
to happen isn't sufficient.  When Canqun Yang submits a patch to the
old RTL loop optimizer that gives a 2.5% improvement on IA-64 SPECfp
it appears on the surface to be a step in the wrong direction.  Of
course, this is competition and it's usually healthy.  What it does
reveal is that (i) the old RTL loop optimizer did a better job, and
was easier to modify for this task than the new one, and (ii) all
attempts to move prefetching to the tree-level have (alledgedly)
done a worse job than the current aging implementation.  Interestingly,
though all of Canqun's changes where generic tweaks to the way prefetching
is done, and are applicable to the new RTL loop optimizer with almost
no changes.


I believe that tree-ssa is ultimately a superior optimization technology
which will obviate the need for many of GCC's RTL optimizers.  Of course,
the situation at the moment/recently is that even with all of the new
optimizers *and* with all of the old optimizers, GCC 4.x typically
produces code of about the same quality, sometimes much better, sometimes
worse.  That's even with the help all of the old optimizers.   Just
switching off RTL optimizations wholesale might not yet be on the map.
Certainly the belief that there's no RTL vs. tree-ssa argument to be
had is misguided or premature.  Neither approach is vastly or even
clearly superior to the other.  I'd like to think that the reason
performance is the same is mostly due to duplicated effort, but with
rival compilers still being significantly ahead on benchmarks, clearly
there's still room for improvement.

What I should say to soften the above, is that the fact that compile-time
has only regressed slightly is perhaps a *miracle* of tree-ssa.  For a
fairer comparison or head-to-head tree-ssa should be allowed to consume
far more time and expend far more effort.  Alternatively, if tree-ssa
generates RTL that requires far less effort of the RTL optimizers, such
that the CPU time overheads of RTL passes have shrunk dramatically then
perhaps there's less pressure to remove them as they are no longer the
compile-time hogs they used to be.


As for Paolo's patch.  I'm still reviewing it.  I've been off-work
sick for the past few days, and as explained above investigating and
understanding the impact of what look like simple changes requires
far more effort than simply deleting a few lines.  In fact, this form
of patch is probably harder to review than it is to write.

As for code coverage arguments, I've argued before that whilst they
can be used to provide convincing supporting evidence, for some
aspects of the compiler they can't by their nature be sufficient
for some things.  Yes, they demonstrate it safe.  Indeed, if I
didn't care so much about the quality of code that GCC generates
I could just say "OK".  Instead, I need to convince myself either
that these transformations aren't worth the efforts or (ii) can
or should be performed elsewhere at the RTL-level.  Ultimately,
I may decide the patch is perfect as written. Removing  code
from the compiler is always harder than adding it.  RTL by its
very nature is platform specific (c.f. cc0 etc...), and the mind
boggling number of command line options and input files, means
that DO-178B style coverage of GCC source isn't likely within my
life time.

Let's face it, even if this transformation no longer could ever
be triggered where it currently is, it was obviously useful when
it was written and added to the compiler, and could potentially
be profitable somewhere else.


Being a GCC maintainer is difficult.  If simply pushing patches on
a RedHat or SuSE build farm was sufficient, life would be much
simpler.  Instead, because automated testing of the possible interactions
can never be sufficient, our development practices rely on intelligent
individuals to give each change the consideration it deserves.  The
fact that I've been slow in doing so (or that nobody else has reviewed
it) is an issue.  But to claim that such diligence is unnecessary is
one of the things that distinguishes maintainers from contributors.

Perhaps I just think, and care, and worry too much.


[All comments and criticisms welcome.  Just because some of my points
are well reasoned and argued, doesn't mean that I'm completely
out-of-my-mind insane on others].

Roger
--

My opinions on tree-level and RTL-level optimization

Reply via email to