On Fri, Feb 6, 2015 at 12:34 PM, Robert Haas <robertmh...@gmail.com> wrote: > 4.
Obviously that went out a bit too soon. Anyway, what I think we should do here is back up a bit and talk about what the problems are that we need to solve here and how each of them should be solved. I think there is some good code in this patch, but we really need to think about what the interfaces should look like and achieve a clean separation of concerns. Looking at the code for the non-parallel SeqScan node, there are basically two things going on here: 1. We call heap_getnext() to get the next tuple and store it into a TupleTableSlot. 2. Via ExecScan(), we do projection and apply quals. My first comment here is that I think we should actually teach heapam.c about parallelism. In other words, let's have an interface like this: extern Size heap_parallelscan_estimate(Snapshot snapshot); extern void heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation, Snapshot snapshot); extern HeapScanDesc heap_beginscan_parallel(ParallelHeapScanDesc); So the idea is that if you want to do a parallel scan, you call heap_parallelscan_estimate() to figure out how much space to reserve in your dynamic shared memory segment. Then you call heap_parallelscan_initialize() to initialize the chunk of memory once you have it. Then each backend that wants to assist in the parallel scan calls heap_beginscan_parallel() on that chunk of memory and gets its own HeapScanDesc. Then, they can all call heap_getnext() just as in the non-parallel case. The ParallelHeapScanDesc needs to contain the relation OID, the snapshot, the ending block number, and a current-block counter. Instead of automatically advancing to the next block, they use one of Andres's nifty new atomic ops to bump the current-block counter and then scan the block just before the new value. All this seems pretty straightforward, and if we decide to later change the way the relation gets scanned (e.g. in 1GB chunks rather than block-by-block) it can be handled here pretty easily. Now, let's suppose that we have this interface and for some reason we don't care about quals and projection - we just want to get the tuples back to the master. It's easy enough to create a parallel context that fires up a worker and lets the worker call heap_beginscan_parallel() and then heap_getnext() in a loop, but what does it do with the resulting tuples? We need a tuple queue that can be used to send the tuples back to master. That's also pretty easy: just set up a shm_mq and use shm_mq_send() to send each tuple. Use shm_mq_receive() in the master to read them back out. The only thing we need to be careful about is that the tuple descriptors match. It must be that they do, because the way the current parallel context patch works, the master is guaranteed to hold a lock on the relation from before the worker starts up until after it dies. But we could stash the tuple descriptor in shared memory and cross-check that it matches just to be sure. Anyway, this doesn't seem terribly complex although we might want to wrap some abstraction around it somehow so that every kind of parallelism that uses tuple queues can benefit from it. Perhaps this could even be built into the parallel context machinery somehow, or maybe it's something executor-specific. At any rate it looks simpler than what you've got now. The complicated part here seems to me to figure out what we need to pass from the parallel leader to the parallel worker to create enough state for quals and projection. If we want to be able to call ExecScan() without modification, which seems like a good goal, we're going to need a ScanState node, which is going to need to contain valid pointers to (at least) a ProjectionInfo, an ExprContext, and a List of quals. That in turn is going to require an ExecutorState. Serializing those things directly doesn't seem very practical; what we instead want to do is figure out what we can pass that will allow easy reconstruction of those data structures. Right now, you're passing the target list, the qual list, the range table, and the params, but the range table doesn't seem to be getting used anywhere. I wonder if we need it. If we could get away with just passing the target list and qual list, and params, we'd be doing pretty well, I think. But I'm not sure exactly what that looks like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers