Re: [HACKERS] asynchronous and vectorized execution

Kouhei Kaigai Mon, 09 May 2016 18:39:58 -0700

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Robert Haas
> Sent: Tuesday, May 10, 2016 2:34 AM
> To: [email protected]
> Subject: [HACKERS] asynchronous and vectorized execution
> 
> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)
If a part of child-nodes are blocked by I/O or other heavy stuff, it
cannot enqueue the results into shm_mq, thus, Gather node naturally
skip nodes that are not ready.
In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.


However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior,

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

> since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.
>
I can agree with the above concern from my experience. Each HashJoin
step needs to fill up TupleTableSlot for each depth. Mostly, it is
just relocation of the attributes in case of multi-tables joins.

If HashJoin could gather five underlying hash-tables at once, it can
reduce unnecessary setup of intermediation tuples.
A position example is GpuHashJoin in PG-Strom. It constructs multi-
relations hash table, then, produce joined tuples at once.
Its performance is generally good.

> But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join.  The inner loop fetches the next tuple from the
> probe table and does a hash lookup.  Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple.  If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more.  Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here.  We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
>
> Obviously, both of these are big projects that could touch a large
> amount of executor code, and there may be other ideas, in addition to
> these, which some of you may be thinking about that could also touch a
> large amount of executor code.  It would be nice to agree on a way
> forward that minimizes code churn and maximizes everyone's attempt to
> contribute without conflicting with each other.  Also, it seems
> desirable to enable, as far as possible, incremental development - in
> particular, it seems to me that it would be good to pick a design that
> doesn't require massive changes to every node all at once.  A single
> patch that adds some capability to every node in the executor in one
> fell swoop is going to be too large to review effectively.
> 
> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper.  For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple.
>
Backward compatibility is good. In addition, child node may want to
know the context when it is called. It may want to switch the behavior
according to the caller's expectation. For example, it may be beneficial
if SeqScan makes more aggressive prefetching on asynchronous execution.

Also, can we consider which data format will be returned from the child
node during the planning stage? It affects to the cost of inter-node
data exchange. If a pair of parent-node and child-node supports its
special data format (like as existing HashJoin and Hash doing), it shall
be a discount factor of cost estimation.

> Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once.  ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted.  In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today.  Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities.  They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it.  If not, ExecProcNode will paper over the problem.  This
> seems to me to be a good way to keep the code simple.
>
> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like.  This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of thinking.  Basically, I propose to have
> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
> by putting them into a new PlanState member called "result", which is
> just a Node * so that we can support multiple types of results,
> instead of returning them.  There is also a result_ready boolean, so
> that a node can return without setting this Boolean to engage
> asynchronous behavior.  This triggers an "event loop", which
> repeatedly waits for FDs chosen by waiting nodes to become readable
> and/or writeable and then gives the node a chance to react.
> Eventually, the waiting node will stop waiting and have a result
> ready, at which point the event loop will give the parent of that node
> a chance to run.  If that node consequently becomes ready, then its
> parent gets a chance to run.  Eventually (we hope), the node for which
> we're waiting becomes ready, and we can then read a result tuple.
> With some more work, this seems like it can handle the FDW case, but I
> haven't worked out how to make it handle the related parallel query
> case.  What we want there is to wait not for the readiness of an FD
> but rather for some other process involved in the parallel query to
> reach a point where it can welcome assistance executing that node.  I
> don't know exactly what the signaling for that should look like yet -
> maybe setting the process latch or something.
> 
> By the way, one smaller executor project that I think we should also
> look at has to do with this comment in nodeSeqScan.c:
> 
> static bool
> SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
> {
>         /*
>          * Note that unlike IndexScan, SeqScan never use keys in 
> heap_beginscan
>          * (and this is very bad) - so, here we do not check are keys ok or 
> not.
>          */
>         return true;
> }
> 
> Some quick prototyping by my colleague Dilip Kumar suggests that, in
> fact, there are cases where pushing down keys into heap_beginscan()
> could be significantly faster.  Some care is required here because any
> functions we execute as scan keys are run with the buffer locked, so
> we had better not run anything very complicated.  But doing this for
> simple things like integer equality operators seems like it could save
> quite a few buffer lock/unlock cycles and some other executor overhead
> as well.
> 
> Thoughts, ideas, suggestions, etc. very welcome.
>

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <[email protected]>


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

Reply via email to