Hi Everyone, TL;DR: Making things faster. Architectural evalation.
as some of you might be aware I've been working on making execution of larger queries in postgresl faster. While working on "batched execution" I came to the conclusion that, while necessary, isn't currently showing a large benefit because expression evaluation and tuple deforming are massive bottlenecks. I'm posting a quite massive series of WIP patches here, to get some feedback. Tuple deforming is slow because of two reasons: 1) It's the first thing that accesses tuples, i.e. it'll often incur cache misses. That's partially fundamental, but also partially can be addressed, e.g. through changing the access order in heap as in . 2) Tuple deforming has a lot of unpredicatable branches, because it has to cope with various types of fields. We e.g. perform alignment in a lot of unneeded cases, do null checks for NOT NULL columns et al. I tried to address 2) by changing the C implementation. That brings some measurable speedups, but it's not huge. A bigger speedup is making slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers; but it's still not huge. Finally I turned to just-in-time (JIT) compiling the code for tuple deforming. That doesn't save the cost of 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The first part is done in 0008, the JITing in 0012. Expression evaluation and projection is another major bottleneck. 1) Our recursive expression evaluation puts a *lot* of pressure on the stack. 2) There's a lot of indirect function calls when recursing to other expression nodes. These are hard to predict, because the same node type (say ExecEvalAnd()) is used in different parts of an expression tree, and invokes different sub-nodes. 3) The function calls to operators and other functions are hard to predict, leading to a significant number of pipeline stalls. 4) There's a fair amount of pg_list.h list style iteration going on, those are cache and pipeline inefficient. After some experimenting I came to the conclusion that the recursive processing is a fundamental impediment to making this faster. I've converted (0006) expression processing and projection into an opcode dispatch based interpreter. That yields, especially for complex expressions and larger projections a significant speedup in itself. But similarly to the deforming, expression evaluation remains a bottleneck after that, primarily because there's still a lot of unpredictable jump and calls, and because loads/stores have to be complex (e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a single scalar var evaluation). Using the opcode based representation of expression evaluation (as it's nearly linear, and has done a lot of the lookups ahead of time), it's actually quite easy to *After JITing expression evaluation itself is more than ten times faster than before*. But unfortunately that doesn't mean that queries are ten times faster - usually we'll hit bottlenecks elsewhere relatively soon. WRT to expression evaluation, the biggest cost afterwards are the relatively high overhead V1 function calls - register based parameter passing is a lot faster. After experimenting a bit with doing JITing manually (a lot of eye-stabbing kind of fun), I chose to use LLVM. An overview of the patch-queue so far: 0001 Make get_last_attnums more generic. Boring prerequisite. 0002 More efficient AggState->pertrans iteration. Relatively boring minor optimization, but it turns out to be a easily hit bottleneck. Will commit independently. 0003 Avoid materializing SRFs in the FROM list. 0004 Allow ROWS FROM to return functions as single record column. 0005 Basic implementation of targetlist SRFs via ROWS FROM. 0006 Remove unused code related to targetlist SRFs. These are basically just pre-requisites for the faster expression evaluation, and discussed elsewhere . This implementation is *NOT* going to survive, because we ended coming to the conclusion that using a separate executor node to expand SRFs is a btter plan. But the new expression evaluation code won't be able to handle SRFs... 0007 WIP: Optimize slot_deform_tuple() significantly. This a) turns tuple deforming into an opcode based dispatch loop (using computed goto on gcc/clang). b) moves a lot of the logic from slot_deform_tuple() callsites into itself - that turns out to be more efficient. I'm not entirely sure it's worth doing the opcode based dispatch part, if we're going to also do the JIT bit - it's a fair amount of code, and the speed difference only matters on large amounts of rows. 0008 WIP: Faster expression processing and targetlist projection. This, functionally nearly complete, patch turns expression evaluation (and tuple deforming as a special case of that) into a "mini language" which is interpreted using either a while(true) switch(opcode) or computed goto to jump from opcode to opcode. It does so by moving a lot more of the code for expression evaluation to initialization time and building a linear series of steps to evaluate expressions, thereby removing all recursion from expression processing. This nearly entirely gets rid of the stack usage cost of expression evaluation (we pretty much never recurse except for subplans). Being able to remove, now redundant, calls to check_stack_depth() is a noticeable benefit, it turns out that that check has a noticeable performance impact (as it aparently forces to actually use the stack, instead of just renumbering registers inside the CPU). The new representation and evaluation is functionally nearly complete (there's a single regression test failure, and I know why that is), but the code needs a fair amount of polishing. I do absolutely think that the fundamentals of this are the right way to go, and I'm going to work hard on polishing the patch up. But this isn't something that we can easily do in parts, and it's a huge ass patch. So I'd like to have at least some more buyin before wasting even more time on this. 0009 WIP: Add minimal keytest implementation. More or less experimental patch that tries to implement simple expression of the OpExpr(ScalarVar, Const) into a single expression evaluation step. The benefits probably aren't big enough iff we do end up doing JITing of expressions. 0010 WIP: Add configure infrastructure to enable LLVM. 0011 WIP: Beginning of a LLVM JIT infrastructure. Very boring preliminary patches to add --with-llvm and some minimal infrastructure to handle LLVM. If we go this way, JITed stuff needs to be tied to resource owners, and we need some other centralized infrastructure. 0012 Heavily-WIP: JITing of tuple deforming. This, in a not-yet-that-nice manner, implements a JITed version of the per-column stuff that slot_deform_tuple() does. It currently always deforms all columns, which obviously would have to change. There's also considerable additional performance improvements possible. With this patch the per-column overhead (minus bitmap handling, which 0007 moved into a separate loop), drops from 10%+ into low single digits for a number of queries. Afterwards the biggest cost is VARSIZE_ANY() for varlena columns (which atm isn't inlined). That is, besides the initial cache-miss when accessing tuple->t_hoff, which JITing can do nothing about :( This can be enabled/disabled using the new jit_tuple_deforming GUC. To make this production ready in some form, we'd have to come up with a way to determine when it's worth doing JITing. The easiest way would be to do so after N slot_deform_tuple() calls or such, another way would be to do it based on cost estimates. 0013 WIP: ExprEval: Make threaded dispatch use a separate field. Boring preliminary patch. Increases memory usage a bit, needs to be thought through more. 0014 Heavily-WIP: JITed expression evaluation. This is the most-interesting bit performance wise. A few common types of expressions are JITed. Scalar value accesses, function calls, boolean expressions, aggregate references. This can be enabled using the new jit_expressions GUC. Even for the supported expression types I've taken some shortcuts (e.g. strict functions aren't actually strict). The performance benefits are quite noticeable. For TPCH ExecEvalExpr() (which is where 0008 moved all of expression evaluation/projection) goes from being the top profile entry, to barely noticeable, with the JITed function usually not showing up in the top five entries anymore. After the patch it becomes very clear that our function call infrastructure is a serious bottlenecks. Passing all the arguments via memory, and, even worse, forcing isnull/values to be on separate cachelines, has significant performance implications. It also becomes quite noticeable that nodeAgg's transition function invocation doesn't go through ExecEvalExpr() but does that itself - which leads to constant mispredictions if several transition values exist. While the JIT code is relatively verbose, it turns out to not actually be that hard to write after some startup pains. All the JITing of expressions that exists so far was basically written in ~10 hours. This also needs some heuristics about when JITing is appropriate. Compiling an expression that's only executed once is never going to be faster than doing the interpretation (it at least needs a writable allocation for the code, and then a remap to make that code read-only and executable). A trace based approach (everything executed at least a thousand times) or cost based (all queries costing more than 100000 should be JITed) could make sense. It's worthwhile to note that at the moment this is a per-query-execution JIT, not something that can trivially be cached for prepared statements. That'll need further infrastructure. 0015 Super-Heavily-WIP: LLVM perf integration. This very very very preliminary patch (including some copy-pasted GPL code!) creates /proc/perf-<pid>.map files, which allows perf to show useful symbols for profile hits to JIT expressions. I plan to push this towards LLVM, so this isn't something PG will have to do, but it's helpful for evaluation. I eventually plan to start separate threads about some of the parts in here, but I think the overal picture needs some discussion first. Q: Why LLVM and not a hand-rolled JIT? A: Because hand-rolling a JIT is probably hard to scale to multiple maintainers, and multiple platforms. I started down the path of doing a hand-rolled x86 JIT, and that'd also be doable (faster compilation, slower execution basically); but I doubt we'd end up having that on different architectures on platforms. Not to speak of things like proper debugger and profiler integration. I'm not entirely convinced that that's the right path. It might also be a transitional step, towards doing our completely own JIT. But I think it's a sensible step. Q: Why LLVM and not $jit-toolkit A: Because all the other JIT stuff I looked at was either really unportable (mostly x86 linux only), inconveniently licensed (like e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for example). I might have missed something, but ISTM that atm the choice is between hand-rolling and using LLVM. Q: Does this actually inline functions from the backend? A: No. That probably is something desirable in the future, but to me that seems like it should be a separate step. The current one's big enough. It's also further increases compilation times, so quite possibly we only want to do so based on another set of heuristics. Q: ? Comments? Questions? Regards, Andres  https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de  https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxc...@alap3.anarazel.de -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers