On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > Here's a version that adds some documentation.
I jumped on a dual socket machine with 36 cores/72 threads and 144GB of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge pages available, and I ran a very simple test: select count(*) from t t1 join t t2 using (i), where the table was created with create table t as select generate_series(1, 400000000)::int i, and then prewarmed into 20GB of shared_buffers. I compared the default behaviour to preallocate_dynamic_shared_memory=20GB, with work_mem set sky high so that there would be no batching (you get a hash table of around 16GB), and I set things up so that I could test with a range of worker processes, and computed the speedup compared to a serial hash join. Here's what I got: Processes Default Preallocated 1 627.6s 9 101.3s = 6.1x 68.1s = 9.2x 18 56.1s = 11.1x 34.9s = 17.9x 27 42.5s = 14.7x 23.5s = 26.7x 36 36.0s = 17.4x 18.2s = 34.4x 45 33.5s = 18.7x 15.5s = 40.5x 54 35.6s = 17.6x 13.6s = 46.1x 63 35.4s = 17.7x 12.2s = 51.4x 72 33.8s = 18.5x 11.3s = 55.5x It scaled nearly perfectly up to somewhere just under 36 threads, and then the slope tapered off a bit so that each extra process was supplying somewhere a bit over half of its potential. I can improve the slope after the halfway point a bit by cranking HASH_CHUNK_SIZE up to 128KB (and it doesn't get much better after that): Processes Default Preallocated 1 627.6s 9 102.7s = 6.1x 67.7s = 9.2x 18 56.8s = 11.1x 34.8s = 18.0x 27 41.0s = 15.3x 23.4s = 26.8x 36 33.9s = 18.5x 18.2s = 34.4x 45 30.1s = 20.8x 15.4s = 40.7x 54 27.2s = 23.0x 13.3s = 47.1x 63 25.1s = 25.0x 11.9s = 52.7x 72 23.8s = 26.3x 10.8s = 58.1x I don't claim that this is representative of any particular workload or server configuration, but it's a good way to show that bottleneck, and it's pretty cool to be able to run a query that previously took over 10 minutes in 10 seconds. (I can shave a further 10% off these times with my experimental hash join prefetching patch, but I'll probably write about that separately when I've figured out why it's not doing better than that...).