I would like to work on push-based executor [1] during GSoC, so I'm
writing to introduce myself and start the discussion of the project. I
think I should mention beforehand that the subject is my master's
thesis topic, and I have already started working on it. This letter is
not (obviously) a ready proposal but rather initial point to talk over
the concept. Below you can see a short review of the idea, description
of benefits for the community, details, related work and some info
about me.

*Brief review*
The idea is described at the wiki page [1] and in the letter [2]. I
propose to replace current ExecProcNode interface between execution
nodes with function called, say, pushTuple that pushes the ready tuple
to the current node's parent.

*Benefits for the community*
Why would we want this? In general, because Postgres executor is slow
for CPU-bound queries and this approach should accelerate it. [4] and
[5] argue that this model results in better code and data locality,
and that JIT compilation makes the difference even more drastic.

Besides, while working on this, in order to study the effects of model
change I will try to investigate the Postgres executor's performance
in both models extensively. For instance, it is commonly accepted that
current Volcano-style model leads to poor usage of modern CPUs
pipelining abilities and large percent of branch mispredictions. I am
going to see whether, where and when this is true in Postgres;
profiling results should be useful for any further executor

*Project details*
Technically, I am planning to implement this as follows. Common for
all nodes code which needs to be changed is in execMain.c and
execProcnode.c; standard_ExecutorRun in execMain.c now should start
execution of all leaf nodes in proper order instead of pulling tuples
one-by-one from top-level node. By 'proper' order here I mean that
inner nodes will be run first, outer nodes second, so that when the
first tuple from outer side of some node arrives to it, the node
already received all its tuples from the inner side.

How we 'start' execution of a leaf? Recall that now instead of
ExecProcNode we have pushTuple function with following signature:

bool pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)

'slot' is the tuple we push. 'node' is a receiver of tuple, 'pusher'
is sender of the tuple, its parent is 'node'. We need 'pusher' only to
distinguish inner and outer pushes. This function returns true if
'node' is still accepting tuples after the push, false if not,
e.g. Limit node can return false after required number of tuples were
passed. We also add the convention that when a node has nothing to
push anymore, it calls pushTuple with slot=NULL to let parent know
that it is done. So, to start execution of a leaf,
standard_ExecutorRun basically needs to call pushTuple(NULL, leaf,
NULL) once. Leaf nodes are a special case because pusher=NULL; another
obvious special case is top-level node: it calls pushTuple(slot, NULL,
node), such call will push the slot to the destination
((*dest->receiveSlot) (slot, dest) in current code).

Like ExecProcNode, pushTuple will call the proper implementation, e.g.
pushTupleToLimit. Such implementations will contain the code similar
to its analogue (e.g. ExecLimit), but, very roughly, where we have

return slot;

now, in push model we will have

bool parent_accepts_tuples = pushTuple(slot, node->parent, node);

and then we will continue execution if parent_accepts_tuples is true
or exit if not.

Complex nodes require more complicated modifications to preserve the
correct behaviour and be efficient. The latter leads to some
architectural issues: for example, efficient SeqScan should call
pushTuple from function doing similar to what heapgettups_pagemode
currently does, otherwise, we would need to save/restore its state
(lines, page, etc) every time for each tuple. On the other hand, it is
not nice to call pushTuple from there because currently access level
(heapam.c) knows nothing about PlanStates. Such issues will need to be
addressed and discussed with the community.

Currently, I have a prototype (pretty much WIP) which implements this
model for SeqScan, Limit, Hash and Hashjoin nodes.

Since TPC-H benchmarks are de facto standard to evaluate such things,
I am planning to to use them for testing. BTW, I’ve written a couple
of scripts to automate this job [16], although it seems that everyone
who tests TPC-H ends up with writing his own version.

Now, it is clear that rewriting all nodes with full support in such a
manner is huge work. Besides, we still don't know quantitative profit
of this model.  Because of that, I do not propose any timeline right
now; instead, we should decide which part of this work (if any) is
going to be done in the course of GSoC. Probably, all TPC-H queries
with and without index support is a good initial target, but this
needs to be discussed. Anyway, I don't think that the result will be a
patch ready to be merged into Postgres master straight away, because
it is rather radical change; and it seems that supporting both
executors simultaneously is also a bad idea because many code would be
duplicated in this case.

*Related work*
There are other works aimed for improving executor performance. I can
mention at least four approaches:
* JITing things [6][10][17]
* batched and/or vectorized execution [7][8][9]
* expression evaluation optimizations [10][17]
* slot deforming optimizations [10]

The latter two are orthogonal to the proposed project.
Batched execution and JIT can be applied together, and some study [10]
shows benefits of such combined approach.

While batched execution and push-based execution model can be applied
together too, they seemingly solve more or less same problems -- code
and data locality, avoiding reloading node's state and better use of
modern CPU features.  However, batched execution requires massive
changes to the current code and seems harder to implement; IIRC I have
seen some patches on this at mailing lists, but I am not aware which
work is the most complete as of now. It is not easy to compare these
approaches theoretically; probably, the best way to estimate their
effect is to implement them and run benchmarks.

Relationship between JIT-compilation and push-based execution model is
interesting: paper [5] shows that HyPer system with JIT + push runs 4x
times faster on some queries than JIT + pull. It should be noted that
HyPer uses column-wise storage though.
Full query compiler developed at ISP RAS [6] speeds up query
processing 2-5 times on TPC-H queries and exploits push-based
execution model. However, supporting it requires implementing executor
logic twice: in plain C for usual interpreter and in LLVM API for JIT
compiler. Ideally we would like to write the code once and be able to
use it either with and without JIT compilation.
There is an ongoing work at ISP RAS to make this possible using
automatic runtime code specialization; however, experiments have
showed that specialization of original Postgres C code doesn't give
significant improvement because of Volcano-style model. We expect that
after making a switch to push-based model in Postgres code we will
achieve speedup comparable to full-query JIT using runtime

*About me*
My name is Arseny Sher. Currently, I am studying master's degree at
CMC MSU [12] and working at ISP RAS [13]. Earlier I got bachelor's
degree at the same faculty. I started working with Postgres at the end
of October and I love its extensibility and excellent quality of
code. My previous work was mainly connected with big graphs
computation (keywords are Spark, GraphX, Scala, GraphLab). I also did
some scala.js coding for Russian philologists and participated in
project for IMDG performance comparison, doing mostly devops stuff
(Docker, Ansible, Python). Here are my stackoverflow [14] & github
accounts [15].

The idea of this project was born when my colleagues working on JITing
Postgres realized that runtime specialization of C code written in
push-based architecture should be much more efficient than
specializing existing code (see 'Related work' section), and at the
time I’ve decided that I want my thesis to be connected with

I am ready to work full-time this summer and I think that push-based
execution of all TPC-H queries is quite an achievable goal; however, I
haven't yet studied all required nodes in detail and I will make more
exact estimations if the community supports this project.

P.S. Should letters like this go to hackers or students mailing list?
The latter seems more suitable, but it looks rather dead...


[4] Efficiently Compiling Efficient Query Plans for Modern Hardware,
[5] Compiling Database Queries into Machine Code,
[6] LLVM Cauldron, slides
[11] Vectorization vs. Compilation in Query Execution
[12] https://cs.msu.ru/en
[13] http://www.ispras.ru/en/
[14] http://stackoverflow.com/users/4014587/ars
[15] https://github.com/arssher
[16] https://github.com/arssher/pgtpch

Arseny Sher

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to