Re: Append with naive multiplexing of FDWs

Thomas Munro Sun, 17 Nov 2019 00:55:53 -0800

On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian <br...@momjian.us> wrote:
> On Wed, Sep  4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:
> > A few years back[1] I experimented with a simple readiness API that
> > would allow Append to start emitting tuples from whichever Foreign
> > Scan has data available, when working with FDW-based sharding.  I used
> > that primarily as a way to test Andres's new WaitEventSet stuff and my
> > kqueue implementation of that, but I didn't pursue it seriously
> > because I knew we wanted a more ambitious async executor rewrite and
> > many people had ideas about that, with schedulers capable of jumping
> > all over the tree etc.
> >
> > Anyway, Stephen Frost pinged me off-list to ask about that patch, and
> > asked why we don't just do this naive thing until we have something
> > better.  It's a very localised feature that works only between Append
> > and its immediate children.  The patch makes it work for postgres_fdw,
> > but it should work for any FDW that can get its hands on a socket.
> >
> > Here's a quick rebase of that old POC patch, along with a demo.  Since
> > 2016, Parallel Append landed, but I didn't have time to think about
> > how to integrate with that so I did a quick "sledgehammer" rebase that
> > disables itself if parallelism is in the picture.
>
> Yes, sharding has been waiting on parallel FDW scans.  Would this work
> for parallel partition scans if the partitions were FDWs?


Yeah, this works for partitions that are FDWs (as shown), but only for
Append, not for Parallel Append.  So you'd have parallelism in the
sense that your N remote shard servers are all doing stuff at the same
time, but it couldn't be in a parallel query on your 'home' server,
which is probably good for things that push down aggregation and bring
back just a few tuples from each shard, but bad for anything wanting
to ship back millions of tuples to chew on locally.  Do you think
that'd be useful enough on its own?

The problem is that parallel safe non-partial plans (like postgres_fdw
scans) are exclusively 'claimed' by one process under Parallel Append,
so with the patch as posted, if you modify it to allow parallelism
then it'll probably give correct answers but nothing prevents a single
process from claiming and starting all the scans and then waiting for
them to be ready, while the other processes miss out on doing any work
at all.  There's probably some kludgy solution involving not letting
any one worker start more than X, and some space cadet solution
involving passing sockets around and teaching libpq to hand over
connections at certain controlled phases of the protocol (due to lack
of threads), but nothing like that has jumped out as the right path so
far.

One idea that seems promising but requires a bunch more infrastructure
is to offload the libpq multiplexing to a background worker that owns
all the sockets, and have it push tuples into a multi-consumer shared
memory queue that regular executor processes could read from.  I have
been wondering if that would be best done by each FDW implementation,
or if there is a way to make a generic infrastructure for converting
parallel-safe executor nodes into partial plans by the use of a
'Scatter' (opposite of Gather) node that can spread the output of any
node over many workers.

If you had that, you'd still want a way for Parallel Append to be
readiness-based, but it would probably look a bit different to this
patch because it'd need to use (vapourware) multiconsumer shm queue
readiness, not fd readiness.  And another kind of fd-readiness
multiplexing would be going on inside the new (vapourware) worker that
handles all the libpq connections (and maybe other kinds of work for
other FDWs that are able to expose a socket).

Re: Append with naive multiplexing of FDWs

Reply via email to