Re: Gather performance analysis

Zhihong Yu Sat, 28 Aug 2021 04:29:40 -0700

On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbal...@gmail.com> wrote:


> On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmh...@gmail.com> wrote:
>
>> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbal...@gmail.com> wrote:
>> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
>> > 1) Non-parallel (default)
>> >  Execution Time: 31627.492 ms
>> >
>> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>> >  Execution Time: 37498.672 ms
>> >
>> > 3) Same as above (2) but with the patch.
>> > Execution Time: 23649.287 ms
>>
>> This strikes me as an amazingly good result. I guess before seeing
>> these results, I would have said that you can't reasonably expect
>> parallel query to win on a query like this because there isn't enough
>> for the workers to do. It's not like they are spending time evaluating
>> filter conditions or anything like that - they're just fetching tuples
>> off of disk pages and sticking them into a queue. And it's unclear to
>> me why it should be better to have a bunch of processes doing that
>> instead of just one. I would have thought, looking at just (1) and
>> (2), that parallelism gained nothing and communication overhead lost 6
>> seconds.
>>
>> But what this suggests is that parallelism gained at least 8 seconds,
>> and communication overhead lost at least 14 seconds. In fact...
>>
>
> Right, good observation.
>
>
>> > - If I apply both Experiment#1 and Experiment#2 patches together then,
>> > we can further reduce the execution time to 20963.539 ms (with 4
>> > workers and 4MB tuple queue size)
>>
>> ...this suggests that parallelism actually gained at least 10-11
>> seconds, and the communication overhead lost at least 15-16 seconds.
>>
>
> Yes
>
>
>> If that's accurate, it's pretty crazy. We might need to drastically
>> reduce the value of parallel_tuple_cost if these results hold up and
>> this patch gets committed.
>>
>
> In one of my experiments[Test1] I have noticed that even on the head the
> force parallel plan is significantly faster compared to the non-parallel
> plan, but with patch it is even better.  The point is now also there might
> be some cases where the force parallel plans are faster but we are not sure
> whether we can reduce the parallel_tuple_cost or not.  But with the patch
> it is definitely sure that the parallel tuple queue is faster compared to
> what we have now, So I agree we should consider reducing the
> parallel_tuple_cost after this patch.
>
> Additionally, I've done some more experiments with artificial workloads,
> as well as workloads where the parallel plan is selected by default, and in
> all cases I've seen a significant improvement.  The gain is directly
> proportional to the load on the tuple queue, as expected.
>
> Test1: (Worker returns all tuples but only few tuples returns to the
> client)
> ----------------------------------------------------
> INSERT INTO t SELECT i%10, repeat('a', 200) from
> generate_series(1,200000000) as i;
> set max_parallel_workers_per_gather=4;
>
> Target Query: SELECT random() FROM t GROUP BY a;
>
> Non-parallel (default plan): 77170.421 ms
> Parallel (parallel_tuple_cost=0):  53794.324 ms
> Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
>
> 20% gain compared force parallel, 45% gain compared to default plan.
>
> Test2: (Parallel case with default parallel_tuple_cost)
> ----------------------------------------------
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
> as i;
>
> set max_parallel_workers_per_gather=4;
> SELECT * from t WHERE a < 17500000;
> Parallel(default plan): 23730.054 ms
> Parallel with patch (default plan): 21614.251 ms
>
> 8 to 10 % gain compared to the default parallel plan.
>
> I have done cleanup in the patch and I will add this to the September
> commitfest.
>
> I am planning to do further testing for identifying the optimal batch size
> in different workloads.  WIth above workload I am seeing similar results
> with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
> have kept as 1/4 of the ring size.  We might change that based on more
> analysis and testing.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
Hi,
Some minor comments.
For shm_mq.c, existing comment says:

 * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+   Size        mqh_send_pending;
    bool        mqh_length_word_complete;
    bool        mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.

+   if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers

Re: Gather performance analysis

Reply via email to