On 11/21/2017 11:58 AM, Ali Çehreli wrote:
LDC, the LLVM-based D compiler, has been adding Link Time
Optimization capabilities over the last several releases. [...]
This talk will look at the results of applying LTO to one set
of applications, eBay's TSV utilities. [...]
Jon Degenhardt is a member of eBay's Search Science team.
[...] D quickly became his favorite programming language, one
he uses whenever he can.
On Friday, 15 December 2017 at 03:08:35 UTC, Ali Çehreli wrote:
This should be live now:
Great! I've added some comments there, pasted here:
Jon, thanks for the extensive talk and testing on LTO!
And thanks for recording / broadcasting :-)
(times are approximate)
7:45 Full vs Thin LTO further clarification: Full LTO is single
threaded optimization and codegen (comparable with putting all
source in one module). Thin LTO loads each module separately and
imports functions it needs from other modules, then after the
optimization and codegen happen in parallel for each module (and
normal linking happens afterwards). LTO's capabilities stem from
having access to functions' source code of other modules, and
knowing which functions are internal to the program (so that they
can be removed, non-ABI-conformant calling convention, etc., also
discussed around 41:30); the importing+optim that happens at the
start of Thin LTO gives you that, with the added advantage of
parallel optim+codegen afterwards.
14:00 If the question was: do you need all libraries to be in
IR: no. LTO works with mixed IR-object files and normal object
files and libraries. Even if linking with non-IR libraries, it
helps to know that no other object file references a symbol (so
you can internalize it and generate better code). But indeed, for
_much_ better optimization potential: the more source you have
compiled with LTO enabled the better.
15:30 Whole source optimization at D-level has indeed higher
potential; at the moment I don't think we do many optimizations
that are only possible at D-level (and so they are done at IR
level; or not at all... I'm working e.g. on devirtualization).
Extra remark: the first step towards that is much deeper and
well-defined spec of D semantics, in abstract machine terms.
15:45 Testing == contributing! And you're testing has greatly
improved LDC's LTO, thanks!
15:50 The ldc-build-runtime tool was made by Martin Kinkelin,
and as you mention it is the enabler for most of your work.
16:15 LDC LTO Windows == integrating LLD into LDC (or using
~30:00 IIRC, the performance regression is due to cross-module
inlining/optim (as you mention), which we get for free with LTO
:-) (that is not to say that we wouldn't like to do
cross-module inlining without LTO)
33:20 Compilation time. LTO skips machine codegen during the
normal compilation, as machine codegen is done in the LTO linking
step. So the slowdown with Thin LTO may not be too much (Thin LTO
being a parallel build). An extreme case where LTO may actually
result in faster codegen: if you have 1 million template function
instantiations in CTFE, but they are not called during runtime,
LTO may easily discard them before they reach the optimization
and machine codegen stage. In such a case, LTO may very well be
faster (optimized machine codegen is time consuming); however,
the IR does have to be created and written to disk, and then read
from disk, that takes time too... Overall, Thin LTO is slower
than a normal `-O3` build, but only by a small ratio, but it also
does more work (the added optimization). The compile speed
difference between Full LTO and Thin LTO is very large (Full LTO
is several times slower).
39:40 Indeed, D doesn't require codegen of templates if we can
prove that it is already codegenned in the library itself: i.e.
you _have_ to _link_ with a template-only library. In C++,
codegen of templates is mandatory (afaik), and thus you do not
have to link with a template-only library (e.g. headers files
only). In D, this culling of template codegen is done to increase
compile speed; in that sense not a fair comparison with C++. For
cross-module inlining / inlining of templated functions: in C++
all template code is available in each codegenned module, so LTO
is not needed to improve things; in D, using LTO makes template
code available that otherwise wouldn't ---> larger (potentially
much larger) relative gains with LTO for D. (this is somewhat
particular to LDC currently; GDC does better cross-module
inlining; try LDC's `-enable-cross-module-inlining`)
56:40 Fully share your thinking that cross-module inlining is the
main source of performance gains
Can't wait to see the results of LTO on Weka.io's (LARGE)
applications. Work in progress...!
Could you add the reference links in the comment section there
too? (can't click on blue links in the video ;-)
Clearly very interested in what your PGO testing will show. :-)