On Tue, Oct 17, 2017 at 5:39 PM, Andres Freund <and...@anarazel.de> wrote: > The precise query doesn't really matter, the observations here are more > general, I hope. > > 1) nodeGather.c re-projects every row from workers. As far as I can tell > that's now always exactly the same targetlist as it's coming from the > worker. Projection capability was added in 8538a6307049590 (without > checking whether it's needed afaict), but I think it in turn often > obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7. My > measurement shows that removing the projection yields quite massive > speedups in queries like yours, which is not too surprising.
That seems like an easy and worthwhile optimization. > I suspect this just needs a tlist_matches_tupdesc check + an if > around ExecProject(). And a test, right now tests pass unless > force_parallel_mode is used even if just commenting out the > projection unconditionally. So, for this to fail, we'd need a query that uses parallelism but where the target list contains a parallel-restricted function. Also, the function should really be such that we'll reliably get a failure rather than only with some small probability. I'm not quite sure how to put together such a test case, but there's probably some way. > 2) The spinlocks both on the the sending and receiving side a quite hot: > > rafia query leader: > + 36.16% postgres postgres [.] shm_mq_receive > + 19.49% postgres postgres [.] s_lock > + 13.24% postgres postgres [.] SetLatch > > To test that theory, here are the timings for > 1) spinlocks present > time: 6593.045 > 2) spinlocks acuisition replaced by *full* memory barriers, which on x86 > is a lock; addl $0,0(%%rsp) > time: 5159.306 > 3) spinlocks replaced by read/write barriers as appropriate. > time: 4687.610 > > By my understanding of shm_mq.c's logic, something like 3) aught to > be doable in a correct manner. There should be, in normal > circumstances, only be one end modifying each of the protected > variables. Doing so instead of using full block atomics with locked > instructions is very likely to yield considerably better performance. I think the sticking point here will be the handling of the mq_detached flag. I feel like I fixed a bug at some point where this had to be checked or set under the lock at the same time we were checking or setting mq_bytes_read and/or mq_bytes_written, but I don't remember the details. I like the idea, though. Not sure what happened to #3 on your list... you went from #2 to #4. > 4) Modulo computations in shm_mq are expensive: > > that we end up with a full blown div makes sense - the compiler can't > know anything about ringsize, therefore it can't optimize to a mask. > I think we should force the size of the ringbuffer to be a power of > two, and use a maks instead of a size for the buffer. This seems like it would require some redesign. Right now we let the caller choose any size they want and subtract off our header size to find the actual queue size. We can waste up to MAXALIGN-1 bytes, but that's not much. This would waste up to half the bytes provided, which is probably not cool. > 5) There's a *lot* of latch interactions. The biggest issue actually is > the memory barrier implied by a SetLatch, waiting for the latch > barely shows up. > > Commenting out the memory barrier - which is NOT CORRECT - improves > timing: > before: 4675.626 > after: 4125.587 > > The correct fix obviously is not to break latch correctness. I think > the big problem here is that we perform a SetLatch() for every read > from the latch. I think it's a little bit of an overstatement to say that commenting out the memory barrier is not correct. When we added that code, we removed this comment: - * Presently, when using a shared latch for interprocess signalling, the - * flag variable(s) set by senders and inspected by the wait loop must - * be protected by spinlocks or LWLocks, else it is possible to miss events - * on machines with weak memory ordering (such as PPC). This restriction - * will be lifted in future by inserting suitable memory barriers into - * SetLatch and ResetLatch. It seems to me that in any case where the data is protected by an LWLock, the barriers we've added to SetLatch and ResetLatch are redundant. I'm not sure if that's entirely true in the spinlock case, because S_UNLOCK() is only documented to have release semantics, so maybe the load of latch->is_set could be speculated before the lock is dropped. But I do wonder if we're just multiplying barriers endlessly instead of trying to think of ways to minimize them (e.g. have a variant of SpinLockRelease that acts as a full barrier instead of a release barrier, and then avoid a second barrier when checking the latch state). All that having been said, a batch variant for reading tuples in bulk might make sense. I think when I originally wrote this code I was thinking that one process might be filling the queue while another process was draining it, and that it might therefore be important to free up space as early as possible. But maybe that's not a very good intuition. > b) Use shm_mq_sendv in tqueue.c by batching up insertions into the > queue whenever it's not empty when a tuple is ready. Batching them with what? I hate to postpone sending tuples we've got; that sounds nice in the case where we're sending tons of tuples at high speed, but there might be other cases where it makes the leader wait. > 6) I've observed, using strace, debug outputs with timings, and top with > a short interval, that quite often only one backend has sufficient > work, while other backends are largely idle. Doesn't that just mean we're bad at choosing how man workers to use? If one worker can't outrun the leader, it's better to have the other workers sleep and keep one the one lucky worker on CPU than to keep context switching. Or so I would assume. > One fairly drastic solution would be to move away from a > single-produce-single-consumer style, per worker, queue, to a global > queue. That'd avoid fairness issues between the individual workers, > at the price of potential added contention. One disadvantage is that > such a combined queue approach is not easily applicable for gather > merge. It might also lead to more contention. > One less drastic approach would be to move to try to drain the queue > fully in one batch, and then move to the next queue. That'd avoid > triggering a small wakeups for each individual tuple, as one > currently would get without the 'stickyness'. I don't know if that is better but it seems worth a try. > It might also be a good idea to use a more directed form of wakeups, > e.g. using a per-worker latch + a wait event set, to avoid iterating > over all workers. I don't follow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers