Hi, On 2017-12-04 10:50:53 -0500, Robert Haas wrote: > Subject: [PATCH 1/2] shm-mq-less-spinlocks-v2
> + * mq_sender and mq_bytes_written can only be changed by the sender. > + * mq_receiver and mq_sender are protected by mq_mutex, although, > importantly, > + * they cannot change once set, and thus may be read without a lock once this > + * is known to be the case. I don't recall our conversation around this anymore, and haven't read down far enough to see the relevant code. Lest I forget: Such construct often need careful use of barriers. > - * mq_detached can be set by either the sender or the receiver, so the mutex > - * must be held to read or write it. Memory barriers could be used here as > - * well, if needed. > + * mq_bytes_read and mq_bytes_written are not protected by the mutex. > Instead, > + * they are written atomically using 8 byte loads and stores. Memory > barriers > + * must be carefully used to synchronize reads and writes of these values > with > + * reads and writes of the actual data in mq_ring. > + * > + * mq_detached needs no locking. It can be set by either the sender or the > + * receiver, but only ever from false to true, so redundant writes don't > + * matter. It is important that if we set mq_detached and then set the > + * counterparty's latch, the counterparty must be certain to see the change > + * after waking up. Since SetLatch begins with a memory barrier and > ResetLatch > + * ends with one, this should be OK. s/should/is/ or similar? Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt? I suspect there won't be much of a performance impact, but it's probably worth checking. > * mq_ring_size and mq_ring_offset never change after initialization, and > * can therefore be read without the lock. > * > - * Importantly, mq_ring can be safely read and written without a lock. Were > - * this not the case, we'd have to hold the spinlock for much longer > - * intervals, and performance might suffer. Fortunately, that's not > - * necessary. At any given time, the difference between mq_bytes_read and > + * At any given time, the difference between mq_bytes_read and Hm, why did you remove the first part about mq_ring itself? > @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, > const void *data, > > while (sent < nbytes) > { > - bool detached; > uint64 rb; > + uint64 wb; > > /* Compute number of ring buffer bytes used and available. */ > - rb = shm_mq_get_bytes_read(mq, &detached); > - Assert(mq->mq_bytes_written >= rb); > - used = mq->mq_bytes_written - rb; > + rb = pg_atomic_read_u64(&mq->mq_bytes_read); > + wb = pg_atomic_read_u64(&mq->mq_bytes_written); > + Assert(wb >= rb); > + used = wb - rb; Just to make sure my understanding is correct: No barriers needed here because "bytes_written" is only written by the sending backend, and "bytes_read" cannot lap it. Correct? > Assert(used <= ringsize); > available = Min(ringsize - used, nbytes - sent); > > /* Bail out if the queue has been detached. */ > - if (detached) > + if (mq->mq_detached) Hm, do all paths here guarantee that mq->mq_detached won't be stored on the stack / register in the first iteration, and then not reread any further? I think it's fine because every branch of the if below ends up in a syscall / barrier, but it might be worth noting on that here. > + /* > + * Since mq->mqh_counterparty_attached is known to be > true at this > + * point, mq_receiver has been set, and it can't change > once set. > + * Therefore, we can read it without acquiring the > spinlock. > + */ > + Assert(mqh->mqh_counterparty_attached); > + SetLatch(&mq->mq_receiver->procLatch); Perhaps mention that this could lead to spuriously signalling the wrong backend in case of detach, but that that is fine? > /* Skip manipulation of our latch if nowait = true. */ > if (nowait) > @@ -934,10 +953,18 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, > const void *data, > } > else > { > - Size offset = mq->mq_bytes_written % > (uint64) ringsize; > - Size sendnow = Min(available, ringsize - > offset); > + Size offset; > + Size sendnow; > + > + offset = wb % (uint64) ringsize; > + sendnow = Min(available, ringsize - offset); I know the scheme isn't new, but I do find it not immediately obvious that 'wb' is short for 'bytes_written'. > - /* Write as much data as we can via a single memcpy(). > */ > + /* > + * Write as much data as we can via a single memcpy(). > Make sure > + * these writes happen after the read of mq_bytes_read, > above. > + * This barrier pairs with the one in > shm_mq_inc_bytes_read. > + */ s/above/above. Otherwise a newer mq_bytes_read could become visible before the corresponding reads have fully finished./? Could you also add a comment as to why you think a read barrier isn't sufficient? IIUC that's the case because we need to prevent reordering in both directions: Can't neither start reading based on a "too new" bytes_read, nor can affort writes to mq_ring being reordered to before the barrier. Correct? > + pg_memory_barrier(); > memcpy(&mq->mq_ring[mq->mq_ring_offset + offset], > (char *) data + sent, sendnow); > sent += sendnow; Btw, this mq_ring_offset stuff seems a bit silly, why don't we use proper padding/union in the struct to make it unnecessary to do that bit of offset calculation every time? I think it currently prevents efficient address calculation instructions from being used. > From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001 > From: Robert Haas <rh...@postgresql.org> > Date: Sat, 4 Nov 2017 19:03:03 +0100 > Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1 > - /* Consume any zero-copy data from previous receive operation. */ > - if (mqh->mqh_consume_pending > 0) > + /* > + * If we've consumed an amount of data greater than 1/4th of the ring > + * size, mark it consumed in shared memory. We try to avoid doing this > + * unnecessarily when only a small amount of data has been consumed, > + * because SetLatch() is fairly expensive and we don't want to do it > + * too often. > + */ > + if (mqh->mqh_consume_pending > mq->mq_ring_size / 4) > { Hm. Why are we doing this at the level of updating the variables, rather than SetLatch calls? Greetings, Andres Freund