Re: [HACKERS] [POC] Faster processing at Gather node

Robert Haas Fri, 20 Oct 2017 07:27:57 -0700

On Wed, Oct 18, 2017 at 4:30 PM, Andres Freund <and...@anarazel.de> wrote:
> Hm. I'm a bit confused/surprised by that. What'd be the worst that can
> happen if we don't immediately detect that the other side is detached?
> At least if we only do so in the non-blocking paths, the worst that
> seems that could happen is that we read/write at most one superflous
> queue entry, rather than reporting an error? Or is the concern that
> detaching might be detected *too early*, before reading the last entry
> from a queue?


Detaching too early is definitely a problem.  A worker is allowed to
start up, write all of its results into the queue, and then detach
without waiting for the leader to read those results.  (Reading
messages that weren't really written would be a problem too, of
course.)

> I'm not convinced by this. Imo the multiplying largely comes from
> superflous actions, like the per-entry SetLatch calls here, rather than
> per batch.
>
> After all I'd benchmarked this *after* an experimental conversion of
> shm_mq to not use spinlocks - so there's actually no external barrier
> providing these guarantees that could be combined with SetLatch()'s
> barrier.

OK.

>> All that having been said, a batch variant for reading tuples in bulk
>> might make sense.  I think when I originally wrote this code I was
>> thinking that one process might be filling the queue while another
>> process was draining it, and that it might therefore be important to
>> free up space as early as possible.  But maybe that's not a very good
>> intuition.
>
> I think that's a sensible goal, but I don't think that has to mean that
> the draining has to happen after every entry. If you'd e.g. have a
> shm_mq_receivev() with 16 iovecs, that'd commonly be only part of a
> single tqueue mq (unless your tuples are > 4k).  I'll note that afaict
> shm_mq_sendv() already batches its SetLatch() calls.

But that's a little different -- shm_mq_sendv() sends one message
collected from multiple places.  There's no more reason for it to wake
up the receiver before the whole message is written that there would
be for shm_mq_send(); it's the same problem.

> I think one important thing a batched drain can avoid is that a worker
> awakes to just put one new tuple into the queue and then sleep
> again. That's kinda expensive.

Yes.  Or - part of a tuple, which is worse.

>> >    b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
>> >       queue whenever it's not empty when a tuple is ready.
>>
>> Batching them with what?  I hate to postpone sending tuples we've got;
>> that sounds nice in the case where we're sending tons of tuples at
>> high speed, but there might be other cases where it makes the leader
>> wait.
>
> Yea, that'd need some smarts. How about doing something like batching up
> locally as long as the queue contains less than one average sized batch?

I don't follow.

> No, I don't think that's necesarily true. If that worker's queue is full
> every-time, then yes. But I think a common scenario is that the
> "current" worker only has a small portion of the queue filled. Draining
> that immediately just leads to increased cacheline bouncing.

Hmm, OK.

> I'd not previously thought about this, but won't staying sticky to the
> current worker potentially cause pins on individual tuples be held for a
> potentially long time by workers not making any progress?

Yes.

>> >    It might also be a good idea to use a more directed form of wakeups,
>> >    e.g. using a per-worker latch + a wait event set, to avoid iterating
>> >    over all workers.
>>
>> I don't follow.
>
> Well, right now we're busily checking each worker's queue. That's fine
> with a handful of workers, but starts to become not that cheap pretty
> soon afterwards. In the surely common case where the workers are the
> bottleneck (because that's when parallelism is worthwhile), we'll check
> each worker's queue once one of them returned a single tuple. It'd not
> be a stupid idea to have a per-worker latch and wait for the latches of
> all workers at once. Then targetedly drain the queues of the workers
> that WaitEventSetWait(nevents = nworkers) signalled as ready.

Hmm, interesting.  But we can't completely ignore the process latch
either, since among other things a worker erroring out and propagating
the error to the leader relies on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] Faster processing at Gather node

Reply via email to