On 2017-10-18 15:46:39 -0400, Robert Haas wrote:
> > 2) The spinlocks both on the the sending and receiving side a quite hot:
> >
> >    rafia query leader:
> > +   36.16%  postgres  postgres            [.] shm_mq_receive
> > +   19.49%  postgres  postgres            [.] s_lock
> > +   13.24%  postgres  postgres            [.] SetLatch
> >
> >    To test that theory, here are the timings for
> >    1) spinlocks present
> >       time: 6593.045
> >    2) spinlocks acuisition replaced by *full* memory barriers, which on x86 
> > is a lock; addl $0,0(%%rsp)
> >       time: 5159.306
> >    3) spinlocks replaced by read/write barriers as appropriate.
> >       time: 4687.610
> >
> >    By my understanding of shm_mq.c's logic, something like 3) aught to
> >    be doable in a correct manner. There should be, in normal
> >    circumstances, only be one end modifying each of the protected
> >    variables. Doing so instead of using full block atomics with locked
> >    instructions is very likely to yield considerably better performance.
> I think the sticking point here will be the handling of the
> mq_detached flag.  I feel like I fixed a bug at some point where this
> had to be checked or set under the lock at the same time we were
> checking or setting mq_bytes_read and/or mq_bytes_written, but I don't
> remember the details.  I like the idea, though.

Hm. I'm a bit confused/surprised by that. What'd be the worst that can
happen if we don't immediately detect that the other side is detached?
At least if we only do so in the non-blocking paths, the worst that
seems that could happen is that we read/write at most one superflous
queue entry, rather than reporting an error? Or is the concern that
detaching might be detected *too early*, before reading the last entry
from a queue?

> Not sure what happened to #3 on your list... you went from #2 to #4.

Threes are boring.

> > 4) Modulo computations in shm_mq are expensive:
> >
> >    that we end up with a full blown div makes sense - the compiler can't
> >    know anything about ringsize, therefore it can't optimize to a mask.
> >    I think we should force the size of the ringbuffer to be a power of
> >    two, and use a maks instead of a size for the buffer.
> This seems like it would require some redesign.  Right now we let the
> caller choose any size they want and subtract off our header size to
> find the actual queue size.  We can waste up to MAXALIGN-1 bytes, but
> that's not much.  This would waste up to half the bytes provided,
> which is probably not cool.

Yea, I think it'd require a shm_mq_estimate_size(Size queuesize), that
returns the next power-of-two queuesize + overhead.

> > 5) There's a *lot* of latch interactions. The biggest issue actually is
> >    the memory barrier implied by a SetLatch, waiting for the latch
> >    barely shows up.
> >
> >    Commenting out the memory barrier - which is NOT CORRECT - improves
> >    timing:
> >    before: 4675.626
> >    after: 4125.587
> >
> >    The correct fix obviously is not to break latch correctness. I think
> >    the big problem here is that we perform a SetLatch() for every read
> >    from the latch.
> I think it's a little bit of an overstatement to say that commenting
> out the memory barrier is not correct.  When we added that code, we
> removed this comment:
> - * Presently, when using a shared latch for interprocess signalling, the
> - * flag variable(s) set by senders and inspected by the wait loop must
> - * be protected by spinlocks or LWLocks, else it is possible to miss events
> - * on machines with weak memory ordering (such as PPC).  This restriction
> - * will be lifted in future by inserting suitable memory barriers into
> - * SetLatch and ResetLatch.
> It seems to me that in any case where the data is protected by an
> LWLock, the barriers we've added to SetLatch and ResetLatch are
> redundant. I'm not sure if that's entirely true in the spinlock case,
> because S_UNLOCK() is only documented to have release semantics, so
> maybe the load of latch->is_set could be speculated before the lock is
> dropped.  But I do wonder if we're just multiplying barriers endlessly
> instead of trying to think of ways to minimize them (e.g. have a
> variant of SpinLockRelease that acts as a full barrier instead of a
> release barrier, and then avoid a second barrier when checking the
> latch state).

I'm not convinced by this. Imo the multiplying largely comes from
superflous actions, like the per-entry SetLatch calls here, rather than
per batch.

After all I'd benchmarked this *after* an experimental conversion of
shm_mq to not use spinlocks - so there's actually no external barrier
providing these guarantees that could be combined with SetLatch()'s

Presumably part of the cost here comes from the fact that the barriers
actually do have an influence over the ordering.

> All that having been said, a batch variant for reading tuples in bulk
> might make sense.  I think when I originally wrote this code I was
> thinking that one process might be filling the queue while another
> process was draining it, and that it might therefore be important to
> free up space as early as possible.  But maybe that's not a very good
> intuition.

I think that's a sensible goal, but I don't think that has to mean that
the draining has to happen after every entry. If you'd e.g. have a
shm_mq_receivev() with 16 iovecs, that'd commonly be only part of a
single tqueue mq (unless your tuples are > 4k).  I'll note that afaict
shm_mq_sendv() already batches its SetLatch() calls.

I think one important thing a batched drain can avoid is that a worker
awakes to just put one new tuple into the queue and then sleep
again. That's kinda expensive.

> >    b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
> >       queue whenever it's not empty when a tuple is ready.
> Batching them with what?  I hate to postpone sending tuples we've got;
> that sounds nice in the case where we're sending tons of tuples at
> high speed, but there might be other cases where it makes the leader
> wait.

Yea, that'd need some smarts. How about doing something like batching up
locally as long as the queue contains less than one average sized batch?

> > 6) I've observed, using strace, debug outputs with timings, and top with
> >    a short interval, that quite often only one backend has sufficient
> >    work, while other backends are largely idle.
> Doesn't that just mean we're bad at choosing how man workers to use?
> If one worker can't outrun the leader, it's better to have the other
> workers sleep and keep one the one lucky worker on CPU than to keep
> context switching.  Or so I would assume.

No, I don't think that's necesarily true. If that worker's queue is full
every-time, then yes. But I think a common scenario is that the
"current" worker only has a small portion of the queue filled. Draining
that immediately just leads to increased cacheline bouncing.

I'd not previously thought about this, but won't staying sticky to the
current worker potentially cause pins on individual tuples be held for a
potentially long time by workers not making any progress?

> >    It might also be a good idea to use a more directed form of wakeups,
> >    e.g. using a per-worker latch + a wait event set, to avoid iterating
> >    over all workers.
> I don't follow.

Well, right now we're busily checking each worker's queue. That's fine
with a handful of workers, but starts to become not that cheap pretty
soon afterwards. In the surely common case where the workers are the
bottleneck (because that's when parallelism is worthwhile), we'll check
each worker's queue once one of them returned a single tuple. It'd not
be a stupid idea to have a per-worker latch and wait for the latches of
all workers at once. Then targetedly drain the queues of the workers
that WaitEventSetWait(nevents = nworkers) signalled as ready.


Andres Freund

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to