I thought I'd begin a discussion as to how to improve performance of the rustc compiler.

First off, I do not see any way we can improve *general* compiler performance by an order of magnitude. The language is simply designed to favor safety and expressivity over compile-time performance. Other than code generation, typechecking is by far the longest pass for large Rust. But there is an upper bound on how fast we can make typechecking, because the language requires subtyping, generics, and a variant of Hindley-Milner type inference. This means that these common tricks cannot be used:

1. Fast C-like typechecking won't work because we need to solve for type variables. For instance, the type of `let x = [];` or `let y = None;` is determined from use, unlike for example C++, Java, C#, or Go.

2. Fast ML-like "type equality can be determined with a pointer comparison" tricks will not work, because we have subtyping and must recur on type structure to unify.

3. Nominal types in general cannot be represented as a simple integer "class ID", as in early Java. They require a heap-allocated vector to represent the type parameter substitutions.

In general, the low-hanging fruit for general compiler performance is mostly picked at this point. I would put an upper bound of compiler performance improvements for all stages of a self-hosted build of the Rust compiler at 20% or so. The reasons for this are:

1. Typechecking and LLVM code generation are mostly optimal. When compiling `rustc`, the time spent in these two passes dwarfs all the others. Typechecking cannot be algorithmically improved, and LLVM code generation is about as straightforward as it can possibly be. The remaining performance issues in these two passes are generally due to allocating too much, but allocation and freeing in Rust is no more than 15% of the compile time. Thus even if we spent all our time on the allocator and got its cost down to a theoretical zero, we would only improve performance by 15% or so.

2. LLVM optimizations end up dominating compile time when they're turned on (75% of compile time). However, the Rust compiler, like most Rust (or C++) code, is dependent on LLVM optimizations for good performance. So if you turn off optimization, you have a slow compiler. But if you turn on optimization, the vast majority of your self-hosting time is spent in LLVM optimizations. The obvious way around this catch-22 is to spend a lot of time manually writing the optimizations that LLVM would have performed into our compiler in order to improve performance at -O0, but I don't think that's a particularly good use of our time, and it would hurt the compiler's maintainability.

There are, however, some more situational things we can do.

# General code generation performance

* We can make `~"string"` allocations (and some others, like ~[ 1, 2, 3, 4, 5 ]) turn into calls to the moral equivalent of `strdup`. This improves some workloads, such as the PGP key in cargo (which should just be a constant in the first place). `rustc` still allocates a lot of strings like this, so this might improve the LLVM portion of `rustc`'s compilation speed.

* Visitor glue should be optional; you should have to opt into its generation, like Haskell's `Data.Typeable`. This would potentially remove 15% of our code size and improve our code generation performance by a similar amount, but, as Graydon points out, it is needed for precise-on-the-heap GC. Perhaps we could use conservative GC at -O0, and thus reduce the amount of visitor glue we need to generate for unoptimized builds.

# -O0 performance

For -O0 (which is the default), we get kicked off LLVM's fast instruction selector far too often. We need to stop generating the instructions that cause LLVM to bail out to the slow SelectionDAG path at -O0.

This only affects -O0, but since that's the most common case that matters for compilation speed, that's fine. Note that these optimizations are severely limited in what they can do for self-hosting performance, for the reasons stated above.

* Invoke instructions cause FastISel bailouts. This means that we can't use the C++ exception infrastructure for task failure if we want fast builds. Graydon has work on an optional return-value-based unwinding mode which is nearing completion. I have a patch in review for a "disable_unwinding" flag, which disables unwinding for failure; this should be safe to turn on for libsyntax and librustc, since they have no need to handle failure gracefully, and doing so improves compile-time -O0 LLVM performance by 1.9x.

* Visitor glue used to cause FastISel bailouts, but this is fixed in incoming.

* Switch instructions cause FastISel bailouts. Pattern matching on enums (and sometimes on integers too) generates these. Drop and take glue on enums generates these too. This shouldn't be too hard to fix.

* Integer division instructions result in FastISel bailouts on x86. We generate a lot of these due to the fact that our vector lengths are in bytes. We could change that, or we could try to hack LLVM, or we could turn integer divisions into function calls to libcore on -O0. (Note that integer division turns into a function call *anyway* on ARM, since ARM has no integer divide instruction. So I'm inclined to try the last one.)

# Memory allocation performance

Our memory allocation is suboptimal in several ways. I do not think that improving it will improve compiler performance as long as you aren't already swapping, but I'll list them anyway.

* We do not have our own allocator; we just use the system malloc. However, we need to trace all allocations, to clean up @ cycles on task death. So we thread all allocations into a doubly-linked list. This is a huge waste of memory for the next and previous pointers. We could fix this by using an allocator that allows us to trace allocations.

I would be surprised if fixing this had a huge impact in performance, but maybe it would bump some allocations that were previously in higher storage classes into the TINY class, which generally has a fast path in the allocator. And, of course, it would reduce swapping when self-hosting if you don't have enough memory.

* We don't clean up @ cycles until task death. Fixing this will, in all likelihood, worsen the compiler's performance. However, its memory usage will improve.

* ~ allocations don't really need to be linked into any list or be traceable, *unless* they contain @ pointers, at which point they do need to be traceable. Fixing this will improve memory usage and improve performance by a negligible amount.

# External metadata

We currently read external crate metadata in its entirety for external crates during a few phases of the compiler. This dominates the compilation time of small programs only, as in larger programs such as rustc, the cost quickly shrinks to nothing compared to the larger compilation. However, since newcomers to Rust generally compile small programs, this is most of the cost they see. Also, this constitutes the majority of the time that our test suite takes. Finally, this is the performance bottleneck for the REPL.

Improving this will not improve the compilation speed of self-hosting by more than 1%. The biggest benefit of fixing this is that small programs will appear to compile instantly, which improves the first impressions of Rust a lot for those used to fast builds in other languages.

* External metadata reading takes a long time (0.3 s). I'm not sure whether all of this is necessary, as I'm not too familiar with this pass.

* Language item collection reads all the items in external crates to look for language items (another 0.3 s). This is silly and is easy to fix; we just add a new table to the metadata that specifies def IDs for language items.

* Name resolution has to read all the items in external crates (another 0.3 s). This was the easiest way to approximate the 0.5 name resolution semantics. (The actual semantics were basically unimplementable, but this algorithm got close enough to work in practice -- usually.) With the new semantics in Rust 0.6 we should be able to do better here and avoid reading modules until they're actually referenced. Unfortunately, fixing this will require rewriting resolve, which is a month's worth of work.

# Stack switching

* We could run rustc with a large stack and avoid stack switching. This is functionality we need for Servo anyway. This might improve compiler performance by 1% or so.

None of these optimizations will improve the `rustc` self-hosting time by anything approaching an order of magnitude. However, I think they could have a positive impact on the experience for newcomers to Rust.

Patrick
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to