On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian <br...@momjian.us> wrote: > On Wed, Sep 4, 2019 at 06:18:31PM +1200, Thomas Munro wrote: > > A few years back[1] I experimented with a simple readiness API that > > would allow Append to start emitting tuples from whichever Foreign > > Scan has data available, when working with FDW-based sharding. I used > > that primarily as a way to test Andres's new WaitEventSet stuff and my > > kqueue implementation of that, but I didn't pursue it seriously > > because I knew we wanted a more ambitious async executor rewrite and > > many people had ideas about that, with schedulers capable of jumping > > all over the tree etc. > > > > Anyway, Stephen Frost pinged me off-list to ask about that patch, and > > asked why we don't just do this naive thing until we have something > > better. It's a very localised feature that works only between Append > > and its immediate children. The patch makes it work for postgres_fdw, > > but it should work for any FDW that can get its hands on a socket. > > > > Here's a quick rebase of that old POC patch, along with a demo. Since > > 2016, Parallel Append landed, but I didn't have time to think about > > how to integrate with that so I did a quick "sledgehammer" rebase that > > disables itself if parallelism is in the picture. > > Yes, sharding has been waiting on parallel FDW scans. Would this work > for parallel partition scans if the partitions were FDWs?
Yeah, this works for partitions that are FDWs (as shown), but only for Append, not for Parallel Append. So you'd have parallelism in the sense that your N remote shard servers are all doing stuff at the same time, but it couldn't be in a parallel query on your 'home' server, which is probably good for things that push down aggregation and bring back just a few tuples from each shard, but bad for anything wanting to ship back millions of tuples to chew on locally. Do you think that'd be useful enough on its own? The problem is that parallel safe non-partial plans (like postgres_fdw scans) are exclusively 'claimed' by one process under Parallel Append, so with the patch as posted, if you modify it to allow parallelism then it'll probably give correct answers but nothing prevents a single process from claiming and starting all the scans and then waiting for them to be ready, while the other processes miss out on doing any work at all. There's probably some kludgy solution involving not letting any one worker start more than X, and some space cadet solution involving passing sockets around and teaching libpq to hand over connections at certain controlled phases of the protocol (due to lack of threads), but nothing like that has jumped out as the right path so far. One idea that seems promising but requires a bunch more infrastructure is to offload the libpq multiplexing to a background worker that owns all the sockets, and have it push tuples into a multi-consumer shared memory queue that regular executor processes could read from. I have been wondering if that would be best done by each FDW implementation, or if there is a way to make a generic infrastructure for converting parallel-safe executor nodes into partial plans by the use of a 'Scatter' (opposite of Gather) node that can spread the output of any node over many workers. If you had that, you'd still want a way for Parallel Append to be readiness-based, but it would probably look a bit different to this patch because it'd need to use (vapourware) multiconsumer shm queue readiness, not fd readiness. And another kind of fd-readiness multiplexing would be going on inside the new (vapourware) worker that handles all the libpq connections (and maybe other kinds of work for other FDWs that are able to expose a socket).