Re: [HACKERS] [POC] Faster processing at Gather node

Robert Haas Wed, 18 Oct 2017 12:47:21 -0700

On Tue, Oct 17, 2017 at 5:39 PM, Andres Freund <and...@anarazel.de> wrote:
> The precise query doesn't really matter, the observations here are more
> general, I hope.
>
> 1) nodeGather.c re-projects every row from workers. As far as I can tell
>    that's now always exactly the same targetlist as it's coming from the
>    worker. Projection capability was added in 8538a6307049590 (without
>    checking whether it's needed afaict), but I think it in turn often
>    obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My
>    measurement shows that removing the projection yields quite massive
>    speedups in queries like yours, which is not too surprising.


That seems like an easy and worthwhile optimization.

>    I suspect this just needs a tlist_matches_tupdesc check + an if
>    around ExecProject(). And a test, right now tests pass unless
>    force_parallel_mode is used even if just commenting out the
>    projection unconditionally.

So, for this to fail, we'd need a query that uses parallelism but
where the target list contains a parallel-restricted function.  Also,
the function should really be such that we'll reliably get a failure
rather than only with some small probability.  I'm not quite sure how
to put together such a test case, but there's probably some way.

> 2) The spinlocks both on the the sending and receiving side a quite hot:
>
>    rafia query leader:
> +   36.16%  postgres  postgres            [.] shm_mq_receive
> +   19.49%  postgres  postgres            [.] s_lock
> +   13.24%  postgres  postgres            [.] SetLatch
>
>    To test that theory, here are the timings for
>    1) spinlocks present
>       time: 6593.045
>    2) spinlocks acuisition replaced by *full* memory barriers, which on x86 
> is a lock; addl $0,0(%%rsp)
>       time: 5159.306
>    3) spinlocks replaced by read/write barriers as appropriate.
>       time: 4687.610
>
>    By my understanding of shm_mq.c's logic, something like 3) aught to
>    be doable in a correct manner. There should be, in normal
>    circumstances, only be one end modifying each of the protected
>    variables. Doing so instead of using full block atomics with locked
>    instructions is very likely to yield considerably better performance.

I think the sticking point here will be the handling of the
mq_detached flag.  I feel like I fixed a bug at some point where this
had to be checked or set under the lock at the same time we were
checking or setting mq_bytes_read and/or mq_bytes_written, but I don't
remember the details.  I like the idea, though.

Not sure what happened to #3 on your list... you went from #2 to #4.

> 4) Modulo computations in shm_mq are expensive:
>
>    that we end up with a full blown div makes sense - the compiler can't
>    know anything about ringsize, therefore it can't optimize to a mask.
>    I think we should force the size of the ringbuffer to be a power of
>    two, and use a maks instead of a size for the buffer.

This seems like it would require some redesign.  Right now we let the
caller choose any size they want and subtract off our header size to
find the actual queue size.  We can waste up to MAXALIGN-1 bytes, but
that's not much.  This would waste up to half the bytes provided,
which is probably not cool.

> 5) There's a *lot* of latch interactions. The biggest issue actually is
>    the memory barrier implied by a SetLatch, waiting for the latch
>    barely shows up.
>
>    Commenting out the memory barrier - which is NOT CORRECT - improves
>    timing:
>    before: 4675.626
>    after: 4125.587
>
>    The correct fix obviously is not to break latch correctness. I think
>    the big problem here is that we perform a SetLatch() for every read
>    from the latch.

I think it's a little bit of an overstatement to say that commenting
out the memory barrier is not correct.  When we added that code, we
removed this comment:

- * Presently, when using a shared latch for interprocess signalling, the
- * flag variable(s) set by senders and inspected by the wait loop must
- * be protected by spinlocks or LWLocks, else it is possible to miss events
- * on machines with weak memory ordering (such as PPC).  This restriction
- * will be lifted in future by inserting suitable memory barriers into
- * SetLatch and ResetLatch.

It seems to me that in any case where the data is protected by an
LWLock, the barriers we've added to SetLatch and ResetLatch are
redundant.  I'm not sure if that's entirely true in the spinlock case,
because S_UNLOCK() is only documented to have release semantics, so
maybe the load of latch->is_set could be speculated before the lock is
dropped.  But I do wonder if we're just multiplying barriers endlessly
instead of trying to think of ways to minimize them (e.g. have a
variant of SpinLockRelease that acts as a full barrier instead of a
release barrier, and then avoid a second barrier when checking the
latch state).

All that having been said, a batch variant for reading tuples in bulk
might make sense.  I think when I originally wrote this code I was
thinking that one process might be filling the queue while another
process was draining it, and that it might therefore be important to
free up space as early as possible.  But maybe that's not a very good
intuition.

>    b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
>       queue whenever it's not empty when a tuple is ready.

Batching them with what?  I hate to postpone sending tuples we've got;
that sounds nice in the case where we're sending tons of tuples at
high speed, but there might be other cases where it makes the leader
wait.

> 6) I've observed, using strace, debug outputs with timings, and top with
>    a short interval, that quite often only one backend has sufficient
>    work, while other backends are largely idle.

Doesn't that just mean we're bad at choosing how man workers to use?
If one worker can't outrun the leader, it's better to have the other
workers sleep and keep one the one lucky worker on CPU than to keep
context switching.  Or so I would assume.

>    One fairly drastic solution would be to move away from a
>    single-produce-single-consumer style, per worker, queue, to a global
>    queue. That'd avoid fairness issues between the individual workers,
>    at the price of potential added contention. One disadvantage is that
>    such a combined queue approach is not easily applicable for gather
>    merge.

It might also lead to more contention.

>    One less drastic approach would be to move to try to drain the queue
>    fully in one batch, and then move to the next queue. That'd avoid
>    triggering a small wakeups for each individual tuple, as one
>    currently would get without the 'stickyness'.

I don't know if that is better but it seems worth a try.

>    It might also be a good idea to use a more directed form of wakeups,
>    e.g. using a per-worker latch + a wait event set, to avoid iterating
>    over all workers.

I don't follow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] Faster processing at Gather node

Reply via email to