On Thu, Jan 21, 2016 at 4:26 AM, Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote: > I put some consideration and trial on callbacks as a means to > async(early)-execution.
Thanks for working on this. >> > Suppose we equip each EState with the ability to fire "callbacks". >> > Callbacks have the signature: >> > >> > typedef bool (*ExecCallback)(PlanState *planstate, TupleTableSlot >> > *slot, void *context); >> > >> > Executor nodes can register immediate callbacks to be run at the >> > earliest possible opportunity using a function like >> > ExecRegisterCallback(estate, callback, planstate, slot, context). >> > They can registered deferred callbacks that will be called when a file >> > descriptor becomes ready for I/O, or when the process latch is set, >> > using a call like ExecRegisterFileCallback(estate, fd, event, >> > callback, planstate, slot, context) or >> > ExecRegisterLatchCallback(estate, callback, planstate, slot, context). > > I considered on this. The immediate callbacks seems fine but > using latch or fds to signal tuple availability doesn't seem to > fit callbacks stored in estate. They are deferrable until > parent's tuple request and such kind of events can be handled at > the time as ExecGather does now. However some kind of > synchronize/waiting mechanism like latch or select() is needed > anyway. I am not entirely sure I understand what you are trying to say here, but if I do understand it then I disagree. Consider an Append node with 1000 children, each a ForeignScan. What we want to do is fire off all 1000 remote queries and then return a tuple from whichever ForeignScan becomes ready first. What we do NOT want to do is fire off all 1000 remote queries and have to either (a) iterate through all 1000 subplans checking repeatedly whether each is ready or (b) pick one of those 1000 subplans to wait for and ignore the other 999 until the one we pick is ready. I don't see any way to achieve what we need here without some way to convince the executor to do a select() across all 1000 fds and take some action based on which one becomes read-ready. > Callback is usable for not-so-common invoked-for-a-event-at-once > operations such like error-handling. For this case, the > operations can be asynch-execution of a node and the event can be > just before ExecProcNode on the topmost node. The first patch > attached allows async-capable nodes to register callbacks on Init > phase and executes them just before Exec phase on the topmost > node. It grately reduces the additional code as the result. My > first impression from the word "callbacks" is this. This strikes me as pretty much uninteresting. I bet if you test this you'll find that it doesn't make anything faster. You're kicking things off asynchronously only a tiny fraction of a second before you would have started them anyway. What we need is a way to proceed asynchronously through the entire execution of the query, not just at the beginning. I understand that your merge-join-double-gather example can benefit from this, but it does so by kicking off both subplans before we know for sure that we'll want to read from both subplans. I admit that optimization rarely kicks in, but it's a potentially huge savings of effort when it does. > Instead, in the second patch, I modified ExecProcNode to return > async status in EState. It will be EXEC_READY or EXEC_EOT(End of > table/No more tuple?) for non-async-capable nodes and > async-capable nodes can set it EXEC_NOT_READY, which indicates > that there could be more tuple but not available yet. > > Async-aware nodes such as Append can go to the next child if the > predecessor returned EXEC_NOT_READY. If all !EXEC_EOT nodes > returned EXEC_NOT_READY, Append will wait using some signaling > mechanism (it runs busily now instead.). As an example, the > second patch modifies ExecAppend to handle it and modified > ExecSeqScan to return EXEC_NOT_READY by certain probability as an > emulation of asynchronous tuple fetching. The UNION ALL query > above returns results stirred among the tree tables as the result. I think the idea of distinguishing between "end of tuples" and "no tuples currently ready" is a good one. I am not particularly excited about this particular signalling mechanism. I am not sure why you want to put this in the EState - isn't that shared between all nodes in the plan tree, and doesn't that create therefore the risk of confusion? What I might imagine is giving ExecProcNode a second argument that functions as an out parameter and is only meaningful when TupIsNull(result). It could for example be a boolean indicating whether more tuples might become available later. Maybe an enum is better, but let's suppose a Boolean for now. At the top of ExecProcNode we would do *async_pending = false. Then, we'd pass async_pending to ExecWhatever function that is potentially async-capable and it could set *async_pending = true if it wants. >> Thanks for the attentive explanation. My concern about this is >> that the latency by synchronizing one by one for every tuple >> between the producer and the consumer. My previous patch is not >> asynchronous on every tuple so it can give a pure gain without >> loss from tuple-wise synchronization. But it looks clean and I >> like it so I'll consider this. >> >> > It seems pretty straightforward to fit Gather into this infrastructure. >> >> Yes. > > If Gather's children become a regular node struct with a name > like Worker(Node), instead of non-Node structure as it is now, we > can generalize the tuple-synchronization mecanism so that it can > be used by other nodes such as ForeginScan. Append(ForegnScan, > ForegnScan,...) with async tuple passing can average multiple > foreign servers so I suppose that it is preferable if no penalty > exists. I don't quite understand what you mean by saying that Gather's children are not a "regular node struct". Which struct are you talking about? I do agree that transferring tuples one by one seems like it might be inefficient. Gather suffered from a problem of this nature which was repaired, at least partially, by commit bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. I'm wondering if we need some kind of tuple-buffering abstraction. A tuple-buffer could have an on-receipt-of-tuples callback; if it doesn't, then the tuples can be read out synchronously one by one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers