Re: Parallel Seq Scan vs kernel read ahead

David Rowley Wed, 10 Jun 2020 05:34:48 -0700

On Wed, 10 Jun 2020 at 17:39, David Rowley <dgrowle...@gmail.com> wrote:
>
> On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas.mu...@gmail.com> wrote:
> > I also heard from Andres that he likes this patch with his AIO
> > prototype, because of the way request merging works.  So it seems like
> > there are several reasons to want it.
> >
> > But ... where should we get the maximum step size from?  A GUC?
>
> I guess we'd need to determine if other step sizes were better under
> any conditions.  I guess one condition would be if there was a LIMIT
> clause. I could check if setting it to 1024 makes any difference, but
> I'm thinking it won't since I got fairly consistent results on all
> worker settings with the patched version.


I did another round of testing on the same machine trying some step
sizes larger than 64 blocks. I can confirm that it does improve the
situation further going bigger than 64.

I got up as far as 16384, but made a couple of additional changes for
that run only. Instead of increasing the ramp-up 1 block at a time, I
initialised phsw_step_size to 1 and multiplied it by 2 until I reached
the chosen step size. With numbers that big, ramping up 1 block at a
time was slow enough that I'd never have reached the target step size

Here are the results of the testing:

Master:

max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
(706.2MB/sec)
max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
(320.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
(318.6MB/sec)

Patched stepsize = 64:

max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
(741.7MB/sec)
max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
(723.2MB/sec)
max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
(731.4MB/sec)

Patched stepsize = 1024:

max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599)
(687.1MB/sec)
max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227)
(1006.04MB/sec)
max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149)
(1079.3MB/sec)

Patched stepsize = 8192:

max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524)
(730.59MB/sec)
max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899)
(1019.0MB/sec)
max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148)
(1150.4MB/sec)

Patched stepsize = 16384 (power 2 ramp-up)

max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599)
(725.16MB/sec)
max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344)
(1077.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025)
(1191.2MB/sec)

I thought about what you mentioned about a GUC, and I think it's a bad
idea to do that. I think it would be better to choose based on the
relation size. For smaller relations, we want to keep the step size
small. Someone may enable parallel query on such a small relation if
they're doing something like calling an expensive function on the
results, so we do need to avoid going large for small relations.

I considered something like:

create function nextpower2(a bigint) returns bigint as $$ declare n
bigint := 1; begin while n < a loop n := n * 2; end loop; return n;
end; $$ language plpgsql;
select pg_size_pretty(power(2,p)::numeric * 8192) rel_size,
nextpower2(power(2,p)::bigint / 1024) as stepsize from
generate_series(1,32) p;
 rel_size | stepsize
----------+----------
 16 kB    |        1
 32 kB    |        1
 64 kB    |        1
 128 kB   |        1
 256 kB   |        1
 512 kB   |        1
 1024 kB  |        1
 2048 kB  |        1
 4096 kB  |        1
 8192 kB  |        1
 16 MB    |        2
 32 MB    |        4
 64 MB    |        8
 128 MB   |       16
 256 MB   |       32
 512 MB   |       64
 1024 MB  |      128
 2048 MB  |      256
 4096 MB  |      512
 8192 MB  |     1024
 16 GB    |     2048
 32 GB    |     4096
 64 GB    |     8192
 128 GB   |    16384
 256 GB   |    32768
 512 GB   |    65536
 1024 GB  |   131072
 2048 GB  |   262144
 4096 GB  |   524288
 8192 GB  |  1048576
 16 TB    |  2097152
 32 TB    |  4194304

So with that algorithm with this 100GB table that I've been using in
my test, we'd go with a step size of 16384. I think we'd want to avoid
going any more than that. The above code means we'll do between just
below 0.1% and 0.2% of the relation per step. If I divided the number
of blocks by say 128 instead of 1024, then that would be about 0.78%
and 1.56% of the relation each time. It's not unrealistic today that
someone might throw that many workers at a job, so, I'd say dividing
by 1024 or even 2048 would likely be about right.

David

Re: Parallel Seq Scan vs kernel read ahead

Reply via email to