Hello, As discussed elsewhere[1][2], our algorithm for deciding when to give up on repartitioning (AKA increasing the number of batches) tends to keep going until it has a number of batches that is a function of the number of distinct well distributed keys. I wanted to move this minor issue away from Tomas Vondra's thread[2] since it's a mostly independent problem.
SET max_parallel_workers_per_gather = 0; SET synchronize_seqscans = off; SET work_mem = '4MB'; CREATE TABLE r AS SELECT generate_series(1, 10000000)::int i; ANALYZE r; -- 1k uniform keys + 1m duplicates CREATE TABLE s1k (i int); INSERT INTO s1k SELECT generate_series(1, 1000)::int i; ALTER TABLE s1k SET (autovacuum_enabled = off); ANALYZE s1k; INSERT INTO s1k SELECT 42 FROM generate_series(1, 1000000); EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s1k USING (i); Buckets: 1048576 (originally 1048576) Batches: 4096 (originally 16) Memory Usage: 35157kB -- 10k uniform keys + 1m duplicates CREATE TABLE s10k (i int); INSERT INTO s10k SELECT generate_series(1, 10000)::int i; ALTER TABLE s10k SET (autovacuum_enabled = off); ANALYZE s10k; INSERT INTO s10k SELECT 42 FROM generate_series(1, 1000000); EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s10k USING (i); Buckets: 131072 (originally 131072) Batches: 32768 (originally 16) Memory Usage: 35157kB See how the number of batches is determined by the number of uniform keys in r? That's because the explosion unfolds until there is *nothing left* but keys that hash to the same value in the problem batch, which means those uniform keys have to keep spreading out until there is something on the order of two batches per key. The point is that it's bounded only by input data (or eventually INT_MAX / 2 and MaxAllocSize), and as Tomas has illuminated, batches eat unmetered memory. Ouch. Here's a quick hack to show that a 95% cut-off fixes those examples. I don't really know how to choose the number, but I suspect it should be much closer to 100 than 50. I think this is the easiest of three fundamental problems that need to be solved in this area. The others are: accounting for per-partition overheads as Tomas pointed out, and providing an actual fallback strategy that respects work_mem when extreme skew is detected OR per-partition overheads dominate. I plan to experiment with nested loop hash join (or whatever you want to call it: the thing where you join every arbitrary fragment of the hash table against the outer batch, and somehow deal with outer match flags) when time permits. [1] https://www.postgresql.org/message-id/flat/CAG_%3D8kBoWY4AXwW%3DCj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development -- Thomas Munro https://enterprisedb.com
fix.patch
Description: Binary data