Re: [HACKERS] [POC] Faster processing at Gather node

Robert Haas Tue, 27 Feb 2018 13:04:33 -0800

On Wed, Feb 7, 2018 at 1:41 PM, Andres Freund <and...@anarazel.de> wrote:
> Well, it's more than just systems like that - for 64bit atomics we
> sometimes do fall back to spinlock based atomics on 32bit systems, even
> if they support 32 bit atomics.


I built with -m32 on my laptop and tried "select aid, count(*) from
pgbench_accounts group by 1 having count(*) > 1" on pgbench at scale
factor 100 with pgbench_accounts_pkey dropped and
max_parallel_workers_per_gather set to 10 on (a) commit
0b5e33f667a2042d7022da8bef31a8be5937aad1 (I know this is a little old,
but I think it doesn't matter), (b) same plus
shm-mq-less-spinlocks-v3, and (c) same plus shm-mq-less-spinlocks-v3
and shm-mq-reduce-receiver-latch-set-v2.

(a) 16563.790 ms, 16625.257 ms, 16496.062 ms
(b) 17217.051 ms, 17157.745 ms, 17225.755 ms [median to median +3.9% vs. (a)]
(c) 15491.947 ms, 15455.840 ms, 15452.649 ms [median to median -7.0%
vs. (a), -10.2% vs (b)]

Do you think that's a problem?  If it is, what do you think we should
do about it?  It seems to me that it's probably OK because (1) with
both patches we still come out ahead and (2) 32-bit systems will
presumably continue to become rarer as time goes on, but you might
disagree.

>> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
>> > the stack / register in the first iteration, and then not reread any
>> > further? I think it's fine because every branch of the if below ends up
>> > in a syscall / barrier, but it might be worth noting on that here.
>>
>> Aargh.  I hate compilers.  I added a comment.  Should I think about
>> changing mq_detached to pg_atomic_uint32 instead?
>
> I think a pg_compiler_barrier() would suffice to alleviate my concern,
> right? If you wanted to go for an atomic, using pg_atomic_flag would
> probably be more appropriate than pg_atomic_uint32.

Hmm, all right, I'll add pg_compiler_barrier().

>> >> -                     /* Write as much data as we can via a single 
>> >> memcpy(). */
>> >> +                     /*
>> >> +                      * Write as much data as we can via a single 
>> >> memcpy(). Make sure
>> >> +                      * these writes happen after the read of 
>> >> mq_bytes_read, above.
>> >> +                      * This barrier pairs with the one in 
>> >> shm_mq_inc_bytes_read.
>> >> +                      */
>> >
>> > s/above/above. Otherwise a newer mq_bytes_read could become visible
>> > before the corresponding reads have fully finished./?
>>
>> I don't find that very clear.  A newer mq_bytes_read could become
>> visible whenever, and a barrier doesn't prevent that from happening.
>
> Well, my point was that the barrier prevents the the write to
> mq_bytes_read becoming visible before the corresponding reads have
> finished. Which then would mean the memcpy() could overwrite them. And a
> barrier *does* prevent that from happening.

I think we're talking about the same thing, but not finding each
others' explanations very clear.

>> Hmm, I'm not sure I understand what you're suggesting, here.  In
>> general, we return with the data for the current message unconsumed,
>> and then consume it the next time the function is called, so that
>> (except when the message wraps the end of the buffer) we can return a
>> pointer directly into the buffer rather than having to memcpy().  What
>> this patch does is postpone consuming the data further, either until
>> we can free up at least a quarter of the ring buffer or until we need
>> to wait for more data. It seemed worthwhile to free up space in the
>> ring buffer occasionally even if we weren't to the point of waiting
>> yet, so that the sender has an opportunity to write new data into that
>> space if it happens to still be running.
>
> What I'm trying to suggest is that instead of postponing an update of
> mq_bytes_read (by storing amount of already performed reads in
> mqh_consume_pending), we continue to update mq_bytes_read but only set
> the latch if your above thresholds are crossed. That way a burst of
> writes can fully fill the ringbuffer, but the cost of doing a SetLatch()
> is amortized. In my testing SetLatch() was the expensive part, not the
> necessary barriers in shm_mq_inc_bytes_read().

OK, I'll try to check how feasible that would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] Faster processing at Gather node

Reply via email to