On Thu, May 4, 2017 at 9:37 PM, Andres Freund <and...@anarazel.de> wrote: > Have those benchmarks, even in a very informal form, been shared / > collected / referenced centrally? I'd be very interested to know where > the different contention points are. Possibilities: > > - in non-resident workloads: too much concurrent IOs submitted, leading > to overly much random IO > - internal contention in the the parallel node, e.g. parallel seqscan > - contention on PG componenents like buffer mapping, procarray, clog > - contention on individual buffers, e.g. btree root pages, or small > tables in nestloop joins > - just too many context switches, due to ineffective parallelization > > probably multiple of those are a problem, but without trying to figure > them out, we're going to have a hard time to develop a better costing > model...
It's pretty easy (but IMHO not very interesting) to measure internal contention in the Parallel Seq Scan node. As David points out downthread, that problem isn't trivial to fix, but it's not that hard, either. I do believe that there is a problem with too much concurrent I/O on things like: Gather -> Parallel Seq Scan on lineitem -> Hash Join -> Seq Scan on lineitem If that goes to multiple batches, you're probably wrenching the disk head all over the place - multiple processes are now reading and writing batch files at exactly the same time. I also strongly suspect that contention on individual buffers can turn into a problem on queries like this: Gather (Merge) -> Merge Join -> Parallel Index Scan -> Index Scan The parallel index scan surely has some upper limit on concurrency, but if that is exceeded, what will tend to happen is that processes will sleep. On the inner side, though, every process is doing a full index scan and chances are good that they are doing it more or less in lock step, hitting the same buffers at the same time. Also consider: Finalize HashAggregate -> Gather -> Partial HashAggregate -> Parallel Seq Scan Suppose that the average group contains ten items which will tend to be widely spaced across the table. As you add workers, the number of workers across which any given group gets spread increases. There's probably a "sweet spot" here. Up to a certain point, adding workers makes it faster, because the work of the Seq Scan and the Partial HashAggregate gets spread across more processes. However, adding workers also increases the number of rows that pass through the Gather node, because as you add workers more groups end up being split across workers, or across more workers. That means more and more of the aggregation starts happening in the Finalize HashAggregate rather than the Partial HashAggregate. If you had for example 20 workers here almost nothing would be happening in the Partial HashAggregate, because chances are good that each of the 10 rows in each group would be encountered by a different worker, so that'd probably be counterproductive. I think there are two separate questions here: 1. How do we reduce or eliminate contention during parallel query execution? 2. How do we make the query planner smarter about picking the optimal number of workers? I think the second problem is both more difficult and more interesting. I think that no matter how much work we do on #1, there are always going to be cases where the amount of effective parallelism has some finite limit - and I think the limit will vary substantially from query to query. So, without #2, we'll either leave a lot of money on the table for queries that can benefit from using a large number of workers, or we'll throw extra workers uselessly at queries where they don't help (or even make things worse). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers