Re: Fast DSM segments

Thomas Munro Thu, 18 Jun 2020 22:44:15 -0700

On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro <[email protected]> wrote:
> Here's a version that adds some documentation.


I jumped on a dual socket machine with 36 cores/72 threads and 144GB
of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge
pages available, and I ran a very simple test: select count(*) from t
t1 join t t2 using (i), where the table was created with create table
t as select generate_series(1, 400000000)::int i, and then prewarmed
into 20GB of shared_buffers.  I compared the default behaviour to
preallocate_dynamic_shared_memory=20GB, with work_mem set sky high so
that there would be no batching (you get a hash table of around 16GB),
and I set things up so that I could test with a range of worker
processes, and computed the speedup compared to a serial hash join.

Here's what I got:

Processes   Default           Preallocated
1           627.6s
9           101.3s = 6.1x     68.1s = 9.2x
18           56.1s = 11.1x    34.9s = 17.9x
27           42.5s = 14.7x    23.5s = 26.7x
36           36.0s = 17.4x    18.2s = 34.4x
45           33.5s = 18.7x    15.5s = 40.5x
54           35.6s = 17.6x    13.6s = 46.1x
63           35.4s = 17.7x    12.2s = 51.4x
72           33.8s = 18.5x    11.3s = 55.5x

It scaled nearly perfectly up to somewhere just under 36 threads, and
then the slope tapered off a bit so that each extra process was
supplying somewhere a bit over half of its potential.  I can improve
the slope after the halfway point a bit by cranking HASH_CHUNK_SIZE up
to 128KB (and it doesn't get much better after that):

Processes   Default           Preallocated
1           627.6s
9           102.7s = 6.1x     67.7s = 9.2x
18           56.8s = 11.1x    34.8s = 18.0x
27           41.0s = 15.3x    23.4s = 26.8x
36           33.9s = 18.5x    18.2s = 34.4x
45           30.1s = 20.8x    15.4s = 40.7x
54           27.2s = 23.0x    13.3s = 47.1x
63           25.1s = 25.0x    11.9s = 52.7x
72           23.8s = 26.3x    10.8s = 58.1x

I don't claim that this is representative of any particular workload
or server configuration, but it's a good way to show that bottleneck,
and it's pretty cool to be able to run a query that previously took
over 10 minutes in 10 seconds.  (I can shave a further 10% off these
times with my experimental hash join prefetching patch, but I'll
probably write about that separately when I've figured out why it's
not doing better than that...).

Re: Fast DSM segments

Reply via email to