> One of the problems which the remaining parallel query patches need to > solve is to correlate Plan or PlanState nodes across multiple > backends. This need arises in a couple of cases. First, if a group > of workers are all executing the same plan nodes, and es_instrument != > 0, then we'll accumulate instrumentation data in each worker, but we > need to collect all of the instrumentation data from each individual > worker and add the totals from each counter in each worker to the > master's totals, so that what finally gets reported includes the > effort spent in all workers. And of course the instrumentation data > for any given node in the worker needs to be added to the > *corresponding node* in the master, not any random node. > > Second, for nodes that are actually parallel aware, like the proposed > Partial Seq Scan node, there's bound to be some shared state. In the > case of a Partial Seq Scan, for example, we need to keep track of > which blocks are yet to be scanned. Amit's patch is currently silly > about this: it assumes there will be only one Partial Seq Scan in a > plan, which makes it easy to find the relevant state. But that's not > an assumption we want to be stuck with. Here again, if there are two > or more Partial Seq Scans in the plan passed to every worker, perhaps > with an Append node on top of them, then we need a way to match them > up across all the backends, so that each group of corresponding nodes > is looking at the same bit of shared state. > > To meet these needs, what I propose to do is have > set_plan_references() assign an integer ID to every plan node that is > unique across the entire plan tree. Then, when we ship part of the > plan tree off to a worker to be executed, the integer IDs will also > get copied (by copyfuncs.c support yet to be added, but it should be > pretty boilerplate ... I think), so that whatever plan node #42 is in > the parallel group leader will also be plan node #42 in the worker. > When I started out, I had the idea of trying to number the PlanState > nodes rather than the Plan nodes, and count on the fact the master and > the worker would traverse the query tree in the same order; since the > worker had only a subtree of the whole plan, it's numbering would be > offset from the master's, but the idea was that you could fix this by > adding an offset. This turns out to not to work, at least not the way > I tried to do it, because ExecInitNode() visits all initPlans > everywhere in the tree before recursing over the main tree, so you > only get identical numbering everywhere if you start with the same > exact tree. In any case, relying on two traversals in separate > processes to visit everything in the same order seems a bit fragile; > putting the ID into the plan node - which is what will actually be > passed down to the worker - seems much more robust. > > It would be nice if we could assign the integer IDs with no gaps, and > with the IDs of a node's children immediately following that of their > parent. The advantage of this is that if you want to have a data > structure for every node in the tree passed to some worker - like a > struct Instrumentation in dynamic shared memory - you can just create > an array and index it by (node ID) - (node ID of uppermost child > passed to worker), and every slot will be in use, so no memory is > wasted and no lookup table is needed. > I entirely agree with the idea of plan-node identifier, however, uncertain whether the node-id shall represent physical location on the dynamic shared memory segment, because (1) Relatively smaller number of node type needs shared state, thus most of array items are empty. (2) Extension that tries to modify plan-tree using planner_hook may need to adjust node-id also.
Even though shm_toc_lookup() has to walk on the toc entries to find out the node-id, it happens at once on beginning of the executor at background worker side. I don't think it makes a significant problem. Thanks, > Assigning the IDs in > set_plan_references() ... almost succeeds at this. The removal of > SubqueryScan nodes as part of that process could create gaps in the > numbering sequence, but probably few enough that we could just ignore > the problem and waste an entry in whatever arrays we create whenever > that optimization applies. > > My main concern with this design is how future-proof it is. I know > Tom is working on ripping the planner apart and putting it back > together again to allow the upper levels of the planner to use paths, > and the discussion of the SS_finalize_plan() changes made some > reference to possibly wanting to mess with set_plan_references(). I'm > not sure if any of those changes would disturb what I'm proposing > here. > > Thoughts? Better ideas? Concept patch attached. > -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers