Re: [HACKERS] [POC] Faster processing at Gather node

Andres Freund Tue, 09 Jan 2018 16:10:50 -0800

Hi,

On 2017-12-04 10:50:53 -0500, Robert Haas wrote:
> Subject: [PATCH 1/2] shm-mq-less-spinlocks-v2



> + * mq_sender and mq_bytes_written can only be changed by the sender.
> + * mq_receiver and mq_sender are protected by mq_mutex, although, 
> importantly,
> + * they cannot change once set, and thus may be read without a lock once this
> + * is known to be the case.

I don't recall our conversation around this anymore, and haven't read
down far enough to see the relevant code. Lest I forget: Such construct
often need careful use of barriers.


> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it.  Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex.  
> Instead,
> + * they are written atomically using 8 byte loads and stores.  Memory 
> barriers
> + * must be carefully used to synchronize reads and writes of these values 
> with
> + * reads and writes of the actual data in mq_ring.
> + *
> + * mq_detached needs no locking.  It can be set by either the sender or the
> + * receiver, but only ever from false to true, so redundant writes don't
> + * matter.  It is important that if we set mq_detached and then set the
> + * counterparty's latch, the counterparty must be certain to see the change
> + * after waking up.  Since SetLatch begins with a memory barrier and 
> ResetLatch
> + * ends with one, this should be OK.

s/should/is/ or similar?


Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
I suspect there won't be much of a performance impact, but it's probably
worth checking.


>   * mq_ring_size and mq_ring_offset never change after initialization, and
>   * can therefore be read without the lock.
>   *
> - * Importantly, mq_ring can be safely read and written without a lock.  Were
> - * this not the case, we'd have to hold the spinlock for much longer
> - * intervals, and performance might suffer.  Fortunately, that's not
> - * necessary.  At any given time, the difference between mq_bytes_read and
> + * At any given time, the difference between mq_bytes_read and

Hm, why did you remove the first part about mq_ring itself?


> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
> const void *data,
>  
>       while (sent < nbytes)
>       {
> -             bool            detached;
>               uint64          rb;
> +             uint64          wb;
>  
>               /* Compute number of ring buffer bytes used and available. */
> -             rb = shm_mq_get_bytes_read(mq, &detached);
> -             Assert(mq->mq_bytes_written >= rb);
> -             used = mq->mq_bytes_written - rb;
> +             rb = pg_atomic_read_u64(&mq->mq_bytes_read);
> +             wb = pg_atomic_read_u64(&mq->mq_bytes_written);
> +             Assert(wb >= rb);
> +             used = wb - rb;

Just to make sure my understanding is correct: No barriers needed here
because "bytes_written" is only written by the sending backend, and
"bytes_read" cannot lap it. Correct?


>               Assert(used <= ringsize);
>               available = Min(ringsize - used, nbytes - sent);
>  
>               /* Bail out if the queue has been detached. */
> -             if (detached)
> +             if (mq->mq_detached)

Hm, do all paths here guarantee that mq->mq_detached won't be stored on
the stack / register in the first iteration, and then not reread any
further? I think it's fine because every branch of the if below ends up
in a syscall / barrier, but it might be worth noting on that here.


> +                     /*
> +                      * Since mq->mqh_counterparty_attached is known to be 
> true at this
> +                      * point, mq_receiver has been set, and it can't change 
> once set.
> +                      * Therefore, we can read it without acquiring the 
> spinlock.
> +                      */
> +                     Assert(mqh->mqh_counterparty_attached);
> +                     SetLatch(&mq->mq_receiver->procLatch);

Perhaps mention that this could lead to spuriously signalling the wrong
backend in case of detach, but that that is fine?

>                       /* Skip manipulation of our latch if nowait = true. */
>                       if (nowait)
> @@ -934,10 +953,18 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
> const void *data,
>               }
>               else
>               {
> -                     Size            offset = mq->mq_bytes_written % 
> (uint64) ringsize;
> -                     Size            sendnow = Min(available, ringsize - 
> offset);
> +                     Size            offset;
> +                     Size            sendnow;
> +
> +                     offset = wb % (uint64) ringsize;
> +                     sendnow = Min(available, ringsize - offset);

I know the scheme isn't new, but I do find it not immediately obvious
that 'wb' is short for 'bytes_written'.


> -                     /* Write as much data as we can via a single memcpy(). 
> */
> +                     /*
> +                      * Write as much data as we can via a single memcpy(). 
> Make sure
> +                      * these writes happen after the read of mq_bytes_read, 
> above.
> +                      * This barrier pairs with the one in 
> shm_mq_inc_bytes_read.
> +                      */

s/above/above. Otherwise a newer mq_bytes_read could become visible
before the corresponding reads have fully finished./?

Could you also add a comment as to why you think a read barrier isn't
sufficient? IIUC that's the case because we need to prevent reordering
in both directions: Can't neither start reading based on a "too new"
bytes_read, nor can affort writes to mq_ring being reordered to before
the barrier. Correct?


> +                     pg_memory_barrier();
>                       memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
>                                  (char *) data + sent, sendnow);
>                       sent += sendnow;

Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
proper padding/union in the struct to make it unnecessary to do that bit
of offset calculation every time? I think it currently prevents
efficient address calculation instructions from being used.


> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
> From: Robert Haas <rh...@postgresql.org>
> Date: Sat, 4 Nov 2017 19:03:03 +0100
> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1

> -     /* Consume any zero-copy data from previous receive operation. */
> -     if (mqh->mqh_consume_pending > 0)
> +     /*
> +      * If we've consumed an amount of data greater than 1/4th of the ring
> +      * size, mark it consumed in shared memory.  We try to avoid doing this
> +      * unnecessarily when only a small amount of data has been consumed,
> +      * because SetLatch() is fairly expensive and we don't want to do it
> +      * too often.
> +      */
> +     if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
>       {

Hm. Why are we doing this at the level of updating the variables, rather
than SetLatch calls?

Greetings,

Andres Freund

Re: [HACKERS] [POC] Faster processing at Gather node

Reply via email to