I basically agree with everything you said but I still think we'll be
able to improve performance quite a bit over time. I guess I'd say
there is a lot of "mid-level hanging fruit" to be picked. Perhaps our
type checker is optimal in the Big O sense, but I am quite confident
that the hidden constants are much larger than they have to be.
Here are some relatively simple things we could do to improve
performance in type-checking, off the top of my head:
- Right now I think we allocate a fair number of empty vectors. If we
used @[] or Option<~[]>, we could avoid allocation on the empty case and
lower overall memory use.
- Convert structural records in the compiler to structs.
- Caching for method lookup results (*) and possibly other similar things.
- More use of arenas (eventually, we're not quite ready for this yet).
Longer term, we could refactor the compiler to support parallel
compilation and to track dependencies.
Niko
(*) This is non-trivial. But right now we definitely do a lot of work
for every method lookup and I'm certain we could cache some of it.
Patrick Walton wrote:
I thought I'd begin a discussion as to how to improve performance of
the rustc compiler.
First off, I do not see any way we can improve *general* compiler
performance by an order of magnitude. The language is simply designed
to favor safety and expressivity over compile-time performance. Other
than code generation, typechecking is by far the longest pass for
large Rust. But there is an upper bound on how fast we can make
typechecking, because the language requires subtyping, generics, and a
variant of Hindley-Milner type inference. This means that these common
tricks cannot be used:
1. Fast C-like typechecking won't work because we need to solve for
type variables. For instance, the type of `let x = [];` or `let y =
None;` is determined from use, unlike for example C++, Java, C#, or Go.
2. Fast ML-like "type equality can be determined with a pointer
comparison" tricks will not work, because we have subtyping and must
recur on type structure to unify.
3. Nominal types in general cannot be represented as a simple integer
"class ID", as in early Java. They require a heap-allocated vector to
represent the type parameter substitutions.
In general, the low-hanging fruit for general compiler performance is
mostly picked at this point. I would put an upper bound of compiler
performance improvements for all stages of a self-hosted build of the
Rust compiler at 20% or so. The reasons for this are:
1. Typechecking and LLVM code generation are mostly optimal. When
compiling `rustc`, the time spent in these two passes dwarfs all the
others. Typechecking cannot be algorithmically improved, and LLVM code
generation is about as straightforward as it can possibly be. The
remaining performance issues in these two passes are generally due to
allocating too much, but allocation and freeing in Rust is no more
than 15% of the compile time. Thus even if we spent all our time on
the allocator and got its cost down to a theoretical zero, we would
only improve performance by 15% or so.
2. LLVM optimizations end up dominating compile time when they're
turned on (75% of compile time). However, the Rust compiler, like most
Rust (or C++) code, is dependent on LLVM optimizations for good
performance. So if you turn off optimization, you have a slow
compiler. But if you turn on optimization, the vast majority of your
self-hosting time is spent in LLVM optimizations. The obvious way
around this catch-22 is to spend a lot of time manually writing the
optimizations that LLVM would have performed into our compiler in
order to improve performance at -O0, but I don't think that's a
particularly good use of our time, and it would hurt the compiler's
maintainability.
There are, however, some more situational things we can do.
# General code generation performance
* We can make `~"string"` allocations (and some others, like ~[ 1, 2,
3, 4, 5 ]) turn into calls to the moral equivalent of `strdup`. This
improves some workloads, such as the PGP key in cargo (which should
just be a constant in the first place). `rustc` still allocates a lot
of strings like this, so this might improve the LLVM portion of
`rustc`'s compilation speed.
* Visitor glue should be optional; you should have to opt into its
generation, like Haskell's `Data.Typeable`. This would potentially
remove 15% of our code size and improve our code generation
performance by a similar amount, but, as Graydon points out, it is
needed for precise-on-the-heap GC. Perhaps we could use conservative
GC at -O0, and thus reduce the amount of visitor glue we need to
generate for unoptimized builds.
# -O0 performance
For -O0 (which is the default), we get kicked off LLVM's fast
instruction selector far too often. We need to stop generating the
instructions that cause LLVM to bail out to the slow SelectionDAG path
at -O0.
This only affects -O0, but since that's the most common case that
matters for compilation speed, that's fine. Note that these
optimizations are severely limited in what they can do for
self-hosting performance, for the reasons stated above.
* Invoke instructions cause FastISel bailouts. This means that we
can't use the C++ exception infrastructure for task failure if we want
fast builds. Graydon has work on an optional return-value-based
unwinding mode which is nearing completion. I have a patch in review
for a "disable_unwinding" flag, which disables unwinding for failure;
this should be safe to turn on for libsyntax and librustc, since they
have no need to handle failure gracefully, and doing so improves
compile-time -O0 LLVM performance by 1.9x.
* Visitor glue used to cause FastISel bailouts, but this is fixed in
incoming.
* Switch instructions cause FastISel bailouts. Pattern matching on
enums (and sometimes on integers too) generates these. Drop and take
glue on enums generates these too. This shouldn't be too hard to fix.
* Integer division instructions result in FastISel bailouts on x86. We
generate a lot of these due to the fact that our vector lengths are in
bytes. We could change that, or we could try to hack LLVM, or we could
turn integer divisions into function calls to libcore on -O0. (Note
that integer division turns into a function call *anyway* on ARM,
since ARM has no integer divide instruction. So I'm inclined to try
the last one.)
# Memory allocation performance
Our memory allocation is suboptimal in several ways. I do not think
that improving it will improve compiler performance as long as you
aren't already swapping, but I'll list them anyway.
* We do not have our own allocator; we just use the system malloc.
However, we need to trace all allocations, to clean up @ cycles on
task death. So we thread all allocations into a doubly-linked list.
This is a huge waste of memory for the next and previous pointers. We
could fix this by using an allocator that allows us to trace allocations.
I would be surprised if fixing this had a huge impact in performance,
but maybe it would bump some allocations that were previously in
higher storage classes into the TINY class, which generally has a fast
path in the allocator. And, of course, it would reduce swapping when
self-hosting if you don't have enough memory.
* We don't clean up @ cycles until task death. Fixing this will, in
all likelihood, worsen the compiler's performance. However, its memory
usage will improve.
* ~ allocations don't really need to be linked into any list or be
traceable, *unless* they contain @ pointers, at which point they do
need to be traceable. Fixing this will improve memory usage and
improve performance by a negligible amount.
# External metadata
We currently read external crate metadata in its entirety for external
crates during a few phases of the compiler. This dominates the
compilation time of small programs only, as in larger programs such as
rustc, the cost quickly shrinks to nothing compared to the larger
compilation. However, since newcomers to Rust generally compile small
programs, this is most of the cost they see. Also, this constitutes
the majority of the time that our test suite takes. Finally, this is
the performance bottleneck for the REPL.
Improving this will not improve the compilation speed of self-hosting
by more than 1%. The biggest benefit of fixing this is that small
programs will appear to compile instantly, which improves the first
impressions of Rust a lot for those used to fast builds in other
languages.
* External metadata reading takes a long time (0.3 s). I'm not sure
whether all of this is necessary, as I'm not too familiar with this pass.
* Language item collection reads all the items in external crates to
look for language items (another 0.3 s). This is silly and is easy to
fix; we just add a new table to the metadata that specifies def IDs
for language items.
* Name resolution has to read all the items in external crates
(another 0.3 s). This was the easiest way to approximate the 0.5 name
resolution semantics. (The actual semantics were basically
unimplementable, but this algorithm got close enough to work in
practice -- usually.) With the new semantics in Rust 0.6 we should be
able to do better here and avoid reading modules until they're
actually referenced. Unfortunately, fixing this will require rewriting
resolve, which is a month's worth of work.
# Stack switching
* We could run rustc with a large stack and avoid stack switching.
This is functionality we need for Servo anyway. This might improve
compiler performance by 1% or so.
None of these optimizations will improve the `rustc` self-hosting time
by anything approaching an order of magnitude. However, I think they
could have a positive impact on the experience for newcomers to Rust.
Patrick
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev