On Tue, May 10, 2016 at 7:57 PM, Andres Freund <and...@anarazel.de> wrote: >> 1. asynchronous execution, by which I mean the ability of a node to >> somehow say that it will generate a tuple eventually, but is not yet >> ready, so that the executor can go run some other part of the plan >> tree while it waits. [...]. It is also a problem >> for parallel query: in a parallel sequential scan, the next worker can >> begin reading the next block even if the current block hasn't yet been >> received from the OS. Whether or not this will be efficient is a >> research question, but it can be done. However, imagine a parallel >> scan of a btree index: we don't know what page to scan next until we >> read the previous page and examine the next-pointer. In the meantime, >> any worker that arrives at that scan node has no choice but to block. >> It would be better if the scan node could instead say "hey, thanks for >> coming but I'm really not ready to be on-CPU just at the moment" and >> potentially allow the worker to go work in some other part of the >> query tree. For that worker to actually find useful work to do >> elsewhere, we'll probably need it to be the case either that the table >> is partitioned or the original query will need to involve UNION ALL, >> but those are not silly cases to worry about, particularly if we get >> native partitioning in 9.7. > > I've to admit I'm not that convinced about the speedups in the !fdw > case. There seems to be a lot easier avenues for performance > improvements.
What I'm talking about is a query like this: SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true; What we do today is run the remote query on the first child table to completion, then start it on the second child table, and so on. Sending all the queries at once can bring a speed-up of a factor of N to a query with N children, and it's completely independent of every other speed-up that we might attempt. This has been under discussion for years on FDW-related threads as a huge problem that we need to fix someday, and I really don't see how it's sane not to try. The shape of what that looks like is of course arguable, but saying the optimization isn't valuable blows my mind. Whether you care about this case or not, this is also important for parallel query. > FWIW, I've even hacked something up for a bunch of simple queries, and > the performance improvements were significant. Besides it only being a > weekend hack project, the big thing I got stuck on was considering how > to exactly determine when to batch and not to batch. Yeah. I think we need a system for signalling nodes as to when they will be run to completion. But a Boolean is somehow unsatisfying; LIMIT 1000000 is more like no LIMIT than it it is like LIMIT 1. I'm tempted to add a numTuples field to every ExecutorState and give upper nodes some way to set it, as a hint. >> For asynchronous execution, I have gone so far as to mock up a bit of >> what this might look like. This shouldn't be taken very seriously at >> this point, but I'm attaching a few very-much-WIP patches to show the >> direction of my line of thinking. Basically, I propose to have >> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples >> by putting them into a new PlanState member called "result", which is >> just a Node * so that we can support multiple types of results, >> instead of returning them. > > What different types of results are you envisioning? TupleTableSlots and TupleTableVectors, mostly. I think the stuff that is currently going through MultiExecProcNode() could probably be folded in as just another type of result. >> Some care is required here because any >> functions we execute as scan keys are run with the buffer locked, so >> we had better not run anything very complicated. But doing this for >> simple things like integer equality operators seems like it could save >> quite a few buffer lock/unlock cycles and some other executor overhead >> as well. > > Hm. Do we really have to keep the page locked in the page-at-a-time > mode? Shouldn't the pin suffice? I think we need a lock to examine MVCC visibility information. A pin is enough to prevent a tuple from being removed, but not from having its xmax and cmax overwritten at almost but not quite exactly the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers