On Wed, 10 Jun 2020 at 17:39, David Rowley <dgrowle...@gmail.com> wrote: > > On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas.mu...@gmail.com> wrote: > > I also heard from Andres that he likes this patch with his AIO > > prototype, because of the way request merging works. So it seems like > > there are several reasons to want it. > > > > But ... where should we get the maximum step size from? A GUC? > > I guess we'd need to determine if other step sizes were better under > any conditions. I guess one condition would be if there was a LIMIT > clause. I could check if setting it to 1024 makes any difference, but > I'm thinking it won't since I got fairly consistent results on all > worker settings with the patched version.
I did another round of testing on the same machine trying some step sizes larger than 64 blocks. I can confirm that it does improve the situation further going bigger than 64. I got up as far as 16384, but made a couple of additional changes for that run only. Instead of increasing the ramp-up 1 block at a time, I initialised phsw_step_size to 1 and multiplied it by 2 until I reached the chosen step size. With numbers that big, ramping up 1 block at a time was slow enough that I'd never have reached the target step size Here are the results of the testing: Master: max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481) (706.2MB/sec) max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556) (320.1MB/sec) max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056) (318.6MB/sec) Patched stepsize = 64: max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364) (741.7MB/sec) max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982) (723.2MB/sec) max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356) (731.4MB/sec) Patched stepsize = 1024: max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599) (687.1MB/sec) max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227) (1006.04MB/sec) max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149) (1079.3MB/sec) Patched stepsize = 8192: max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524) (730.59MB/sec) max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899) (1019.0MB/sec) max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148) (1150.4MB/sec) Patched stepsize = 16384 (power 2 ramp-up) max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599) (725.16MB/sec) max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344) (1077.1MB/sec) max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025) (1191.2MB/sec) I thought about what you mentioned about a GUC, and I think it's a bad idea to do that. I think it would be better to choose based on the relation size. For smaller relations, we want to keep the step size small. Someone may enable parallel query on such a small relation if they're doing something like calling an expensive function on the results, so we do need to avoid going large for small relations. I considered something like: create function nextpower2(a bigint) returns bigint as $$ declare n bigint := 1; begin while n < a loop n := n * 2; end loop; return n; end; $$ language plpgsql; select pg_size_pretty(power(2,p)::numeric * 8192) rel_size, nextpower2(power(2,p)::bigint / 1024) as stepsize from generate_series(1,32) p; rel_size | stepsize ----------+---------- 16 kB | 1 32 kB | 1 64 kB | 1 128 kB | 1 256 kB | 1 512 kB | 1 1024 kB | 1 2048 kB | 1 4096 kB | 1 8192 kB | 1 16 MB | 2 32 MB | 4 64 MB | 8 128 MB | 16 256 MB | 32 512 MB | 64 1024 MB | 128 2048 MB | 256 4096 MB | 512 8192 MB | 1024 16 GB | 2048 32 GB | 4096 64 GB | 8192 128 GB | 16384 256 GB | 32768 512 GB | 65536 1024 GB | 131072 2048 GB | 262144 4096 GB | 524288 8192 GB | 1048576 16 TB | 2097152 32 TB | 4194304 So with that algorithm with this 100GB table that I've been using in my test, we'd go with a step size of 16384. I think we'd want to avoid going any more than that. The above code means we'll do between just below 0.1% and 0.2% of the relation per step. If I divided the number of blocks by say 128 instead of 1024, then that would be about 0.78% and 1.56% of the relation each time. It's not unrealistic today that someone might throw that many workers at a job, so, I'd say dividing by 1024 or even 2048 would likely be about right. David