at pgcon, in [1], and various other threads and conferences we talked
about our executor performance needing improvements to perform well in
!OLTP workloads, and how we can get there.

I've played with various techniques, on and off over the last few
weeks. Including trying to make slot_deform_tuple et al. faster, batched
execution for a few node types (SeqScan, hashed Agg), avoiding linked
lists in executor datastructures, and smaller micro optimizations.

As a motivation, here's a somewhat juicy example of the benefits that
can be gained (disabled parallelism, results vary too much):
SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC 
LIMIT 10 ;
non-optimized: 9075.835 ms
    optimized: 5194.024 ms

and that's far from doing all the possible work for that query;
especially the batching is far from optimal.

So far I've mostly looked at performance for tpc-h, not because it's a
particularly good workload, but because it's available.  If somebody has
good suggestions for an analytics heavy workload...

My observations about the performance bottlenecks I found, and partially
addressed, are in rough order of importance (there's interdependence
between most of them):

1) Cache misses cost us a lot, doing more predictable accesses can make
   a huge difference. Without addressing that, many other bottlenecks
   don't matter all that much.  I see *significant* performance
   improvements for large seqscans (when the results are used) simply
   memcpy()'ing the current target block.

   This partially is an intrinsic problem of analyzing a lot of data,
   and partially because our access patterns are bad.

2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
   queries I looked at. There's various places we trigger deforming,
   most ending in either slot_deform_tuple(), heap_getattr(),

   This can be optimized significantly, but even after that it's still a
   major issue.

   Part of the problem is that we sometimes don't know how many elements
   to access (especially in index and hashing related code), the other
   part that we're deforming a lot of columns that we'll never need,
   just because we need a subsequent one.

   The other part is that our tuple format requires a number of
   relatively expensive operations to access data (e.g. alignment
   computations, checking the null bitmap).

3) Our 1-by-1 tuple flow in the executor has two major issues:

   The biggest is that we perform a *lot* of repetitive work for each
   processed tuple. E.g. looking at nodeAgg.c's agg_fill_hash_table(),
   for each single tuple we execute ExecProcNode(), ExecSeqScan(),
   heap_getnext(), lookup_hash_entry(), LookupTupleHashEntry(), ... and
   many more. Each of these has a noticeable per-call state setup.
   When executing on batches of tuples, a lot of that setup work can be
   done once per batch, instead of once per tuple. Part of that the
   compiler can do automatically, part of that has to be done by hand.

   Also very significantly, due to the amount of code executed, there's
   barely any working branch prediction, leading to massive amounts of
   pipeline stalls and instruction cache misses.

   There's also the fact that some future optimizations around using
   SIMD are easier when looking at batches of tuples, but I'm not
   planning to work on that.

4) Various missing micro optimizations have to be performed, for more
   architectural issues to become visible. E.g. [2] causes such bad
   slowdowns in hash-agg workloads, that other bottlenecks are hidden.

5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar
   issues. Using a per-block heap_getmany() that fills a batch at once
   is noticeably faster (and more convenient in a batched world anyway)

6) The use of linked lists adds noticeable #instructions overhead and
   branch misses. Just replacing {FuncExprState,BoolExprState}->args
   with an array gives a ~10%ish boost for queries with a bunch of
   quals.  Another example that appears to hurts noticeably, but which
   I've not experimentally fixed, is AggState->hash_needed.

   To a large degree these seem fairly easily fixable; it's kinda boring
   work, but not really hard.

I'm planning to start individual subthreads for some of these, with
in-detail discussions and/or patches. But I wanted to present this on a
higher level once, before spending even more time.

Have I missed concrete issues others noticed? Missed significant
improvements (please, please, only realistic stuff)?

Unfortunately these items are heavily interdependent. For some queries
you'll need issue n) addressed before n+2) shows a benefit, for some
it's the other way, and such.  Given the size of the overall task, the
only way I can see this being reasonably addressable, is trying to get
smaller steps in individually, even if not a huge win on their own. And
then focus on the the larger architectural changes (primarily 3))

Comments so far?  Different suggestions on how to get improvements
around this merged?


Andres Freund


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to