Re: [rust-dev] Plans for improving compiler performance

Niko Matsakis Sun, 06 Jan 2013 09:16:47 -0800

I basically agree with everything you said but I still think we'll beable to improve performance quite a bit over time. I guess I'd saythere is a lot of "mid-level hanging fruit" to be picked. Perhaps ourtype checker is optimal in the Big O sense, but I am quite confidentthat the hidden constants are much larger than they have to be.

Here are some relatively simple things we could do to improveperformance in type-checking, off the top of my head:

- Right now I think we allocate a fair number of empty vectors. If weused @[] or Option<~[]>, we could avoid allocation on the empty case andlower overall memory use.

- Convert structural records in the compiler to structs.
- Caching for method lookup results (*) and possibly other similar things.
- More use of arenas (eventually, we're not quite ready for this yet).

Longer term, we could refactor the compiler to support parallelcompilation and to track dependencies.



Niko

(*) This is non-trivial. But right now we definitely do a lot of workfor every method lookup and I'm certain we could cache some of it.


Patrick Walton wrote:

I thought I'd begin a discussion as to how to improve performance ofthe rustc compiler.
First off, I do not see any way we can improve *general* compilerperformance by an order of magnitude. The language is simply designedto favor safety and expressivity over compile-time performance. Otherthan code generation, typechecking is by far the longest pass forlarge Rust. But there is an upper bound on how fast we can maketypechecking, because the language requires subtyping, generics, and avariant of Hindley-Milner type inference. This means that these commontricks cannot be used:
1. Fast C-like typechecking won't work because we need to solve fortype variables. For instance, the type of `let x = [];` or `let y =None;` is determined from use, unlike for example C++, Java, C#, or Go.
2. Fast ML-like "type equality can be determined with a pointercomparison" tricks will not work, because we have subtyping and mustrecur on type structure to unify.
3. Nominal types in general cannot be represented as a simple integer"class ID", as in early Java. They require a heap-allocated vector torepresent the type parameter substitutions.
In general, the low-hanging fruit for general compiler performance ismostly picked at this point. I would put an upper bound of compilerperformance improvements for all stages of a self-hosted build of theRust compiler at 20% or so. The reasons for this are:
1. Typechecking and LLVM code generation are mostly optimal. Whencompiling `rustc`, the time spent in these two passes dwarfs all theothers. Typechecking cannot be algorithmically improved, and LLVM codegeneration is about as straightforward as it can possibly be. Theremaining performance issues in these two passes are generally due toallocating too much, but allocation and freeing in Rust is no morethan 15% of the compile time. Thus even if we spent all our time onthe allocator and got its cost down to a theoretical zero, we wouldonly improve performance by 15% or so.
2. LLVM optimizations end up dominating compile time when they'returned on (75% of compile time). However, the Rust compiler, like mostRust (or C++) code, is dependent on LLVM optimizations for goodperformance. So if you turn off optimization, you have a slowcompiler. But if you turn on optimization, the vast majority of yourself-hosting time is spent in LLVM optimizations. The obvious wayaround this catch-22 is to spend a lot of time manually writing theoptimizations that LLVM would have performed into our compiler inorder to improve performance at -O0, but I don't think that's aparticularly good use of our time, and it would hurt the compiler'smaintainability.
There are, however, some more situational things we can do.

# General code generation performance
* We can make `~"string"` allocations (and some others, like ~[ 1, 2,3, 4, 5 ]) turn into calls to the moral equivalent of `strdup`. Thisimproves some workloads, such as the PGP key in cargo (which shouldjust be a constant in the first place). `rustc` still allocates a lotof strings like this, so this might improve the LLVM portion of`rustc`'s compilation speed.
* Visitor glue should be optional; you should have to opt into itsgeneration, like Haskell's `Data.Typeable`. This would potentiallyremove 15% of our code size and improve our code generationperformance by a similar amount, but, as Graydon points out, it isneeded for precise-on-the-heap GC. Perhaps we could use conservativeGC at -O0, and thus reduce the amount of visitor glue we need togenerate for unoptimized builds.
# -O0 performance
For -O0 (which is the default), we get kicked off LLVM's fastinstruction selector far too often. We need to stop generating theinstructions that cause LLVM to bail out to the slow SelectionDAG pathat -O0.
This only affects -O0, but since that's the most common case thatmatters for compilation speed, that's fine. Note that theseoptimizations are severely limited in what they can do forself-hosting performance, for the reasons stated above.
* Invoke instructions cause FastISel bailouts. This means that wecan't use the C++ exception infrastructure for task failure if we wantfast builds. Graydon has work on an optional return-value-basedunwinding mode which is nearing completion. I have a patch in reviewfor a "disable_unwinding" flag, which disables unwinding for failure;this should be safe to turn on for libsyntax and librustc, since theyhave no need to handle failure gracefully, and doing so improvescompile-time -O0 LLVM performance by 1.9x.
* Visitor glue used to cause FastISel bailouts, but this is fixed inincoming.
* Switch instructions cause FastISel bailouts. Pattern matching onenums (and sometimes on integers too) generates these. Drop and takeglue on enums generates these too. This shouldn't be too hard to fix.
* Integer division instructions result in FastISel bailouts on x86. Wegenerate a lot of these due to the fact that our vector lengths are inbytes. We could change that, or we could try to hack LLVM, or we couldturn integer divisions into function calls to libcore on -O0. (Notethat integer division turns into a function call *anyway* on ARM,since ARM has no integer divide instruction. So I'm inclined to trythe last one.)
# Memory allocation performance
Our memory allocation is suboptimal in several ways. I do not thinkthat improving it will improve compiler performance as long as youaren't already swapping, but I'll list them anyway.
* We do not have our own allocator; we just use the system malloc.However, we need to trace all allocations, to clean up @ cycles ontask death. So we thread all allocations into a doubly-linked list.This is a huge waste of memory for the next and previous pointers. Wecould fix this by using an allocator that allows us to trace allocations.
I would be surprised if fixing this had a huge impact in performance,but maybe it would bump some allocations that were previously inhigher storage classes into the TINY class, which generally has a fastpath in the allocator. And, of course, it would reduce swapping whenself-hosting if you don't have enough memory.
* We don't clean up @ cycles until task death. Fixing this will, inall likelihood, worsen the compiler's performance. However, its memoryusage will improve.
* ~ allocations don't really need to be linked into any list or betraceable, *unless* they contain @ pointers, at which point they doneed to be traceable. Fixing this will improve memory usage andimprove performance by a negligible amount.
# External metadata
We currently read external crate metadata in its entirety for externalcrates during a few phases of the compiler. This dominates thecompilation time of small programs only, as in larger programs such asrustc, the cost quickly shrinks to nothing compared to the largercompilation. However, since newcomers to Rust generally compile smallprograms, this is most of the cost they see. Also, this constitutesthe majority of the time that our test suite takes. Finally, this isthe performance bottleneck for the REPL.
Improving this will not improve the compilation speed of self-hostingby more than 1%. The biggest benefit of fixing this is that smallprograms will appear to compile instantly, which improves the firstimpressions of Rust a lot for those used to fast builds in otherlanguages.
* External metadata reading takes a long time (0.3 s). I'm not surewhether all of this is necessary, as I'm not too familiar with this pass.
* Language item collection reads all the items in external crates tolook for language items (another 0.3 s). This is silly and is easy tofix; we just add a new table to the metadata that specifies def IDsfor language items.
* Name resolution has to read all the items in external crates(another 0.3 s). This was the easiest way to approximate the 0.5 nameresolution semantics. (The actual semantics were basicallyunimplementable, but this algorithm got close enough to work inpractice -- usually.) With the new semantics in Rust 0.6 we should beable to do better here and avoid reading modules until they'reactually referenced. Unfortunately, fixing this will require rewritingresolve, which is a month's worth of work.
# Stack switching
* We could run rustc with a large stack and avoid stack switching.This is functionality we need for Servo anyway. This might improvecompiler performance by 1% or so.
None of these optimizations will improve the `rustc` self-hosting timeby anything approaching an order of magnitude. However, I think theycould have a positive impact on the experience for newcomers to Rust.
Patrick
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Plans for improving compiler performance

Reply via email to