Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro Wed, 15 May 2019 18:24:07 -0700

Hello,

As discussed elsewhere[1][2], our algorithm for deciding when to give
up on repartitioning (AKA increasing the number of batches) tends to
keep going until it has a number of batches that is a function of the
number of distinct well distributed keys.  I wanted to move this minor
issue away from Tomas Vondra's thread[2] since it's a mostly
independent problem.


SET max_parallel_workers_per_gather = 0;
SET synchronize_seqscans = off;
SET work_mem = '4MB';

CREATE TABLE r AS SELECT generate_series(1, 10000000)::int i;
ANALYZE r;

-- 1k uniform keys + 1m duplicates
CREATE TABLE s1k (i int);
INSERT INTO s1k SELECT generate_series(1, 1000)::int i;
ALTER TABLE s1k SET (autovacuum_enabled = off);
ANALYZE s1k;
INSERT INTO s1k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s1k USING (i);

  Buckets: 1048576 (originally 1048576)
  Batches: 4096 (originally 16)
  Memory Usage: 35157kB

-- 10k uniform keys + 1m duplicates
CREATE TABLE s10k (i int);
INSERT INTO s10k SELECT generate_series(1, 10000)::int i;
ALTER TABLE s10k SET (autovacuum_enabled = off);
ANALYZE s10k;
INSERT INTO s10k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s10k USING (i);

  Buckets: 131072 (originally 131072)
  Batches: 32768 (originally 16)
  Memory Usage: 35157kB

See how the number of batches is determined by the number of uniform
keys in r?  That's because the explosion unfolds until there is
*nothing left* but keys that hash to the same value in the problem
batch, which means those uniform keys have to keep spreading out until
there is something on the order of two batches per key.  The point is
that it's bounded only by input data (or eventually INT_MAX / 2 and
MaxAllocSize), and as Tomas has illuminated, batches eat unmetered
memory.  Ouch.

Here's a quick hack to show that a 95% cut-off fixes those examples.
I don't really know how to choose the number, but I suspect it should
be much closer to 100 than 50.  I think this is the easiest of three
fundamental problems that need to be solved in this area.  The others
are: accounting for per-partition overheads as Tomas pointed out, and
providing an actual fallback strategy that respects work_mem when
extreme skew is detected OR per-partition overheads dominate.  I plan
to experiment with nested loop hash join (or whatever you want to call
it: the thing where you join every arbitrary fragment of the hash
table against the outer batch, and somehow deal with outer match
flags) when time permits.

[1] 
https://www.postgresql.org/message-id/flat/CAG_%3D8kBoWY4AXwW%3DCj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development

-- 
Thomas Munro
https://enterprisedb.com

fix.patch
Description: Binary data

Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to