Hi,
Here's a couple draft patches fixing the bug:
- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit
- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.
0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).
At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]. It'd not need the loop, but it'd still need all this
new overflow protection etc.
I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:
create table t (a bigint, b text);
insert into t select i, md5(i::text)
from generate_series(1,4000000000) s(i);
and I added this log message at the end of ExecChooseHashTableSize:
elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
work_mem, nbatch, nbuckets, (*space_allowed)/1024,
(*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);
and I ran an explain on a self-join
set enable_mergejoin = off;
set max_parallel_workers_per_gather = 0;
set work_mem = '...';
explain select * from t t1 join t t2 on (t1.a = t2.a);
with work_mem set to values between 64kB and 1GB.
On 18.0 I got this:
wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:
wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.
With the attached patches, this looks like this:
wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
So, this time it actually seems to work correctly and significantly
reduces the memory usage ...
There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.
The reason is that when calculating nbatch we do this:
dbatch = Min(dbatch, max_pointers);
and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.
AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.
regards
[1]
https://www.postgresql.org/message-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_%2BzecoZYMU%2Bdn3BVVaZg%40mail.gmail.com
--
Tomas Vondra
From d1f0bdc6c4a666fa9e7d4bb1836d656bfec6b4a3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Wed, 24 Sep 2025 22:03:30 +0200
Subject: [PATCH vfix 4/4] reword balancing comment
---
src/backend/executor/nodeHash.c | 64 +++++++++++----------------------
1 file changed, 21 insertions(+), 43 deletions(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b2be2091fb2..145c19db8f8 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -851,59 +851,37 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
/*
* Optimize the total amount of memory consumed by the hash node.
*
- * The nbatch calculation above focuses on the size of the in-memory hash
- * table, assuming no per-batch overhead. Now adjust the number of batches
- * and the size of the hash table to minimize total memory consumed by the
- * hash node.
- *
- * Each batch file has a BLCKSZ buffer, and we may need two files per
- * batch (inner and outer side). So with enough batches this can be
- * significantly more memory than the hashtable itself.
+ * The nbatch calculation above focuses on the the in-memory hash table,
+ * assuming no per-batch overhead. But each batch may have two files, each
+ * with a BLCKSZ buffer. With enough batches these buffers may use
+ * significantly more memory than the hash table.
*
* The total memory usage may be expressed by this formula:
*
- * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+ * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
*
* where (inner_rel_bytes / nbatch) is the size of the in-memory hash
* table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
- * buffers. But for sufficiently large values of inner_rel_bytes value
- * there may not be a nbatch value that would make both parts fit into
- * hash_table_bytes.
- *
- * In this case we can't enforce the memory limit - we're going to exceed
- * it. We can however minimize the impact and use as little memory as
- * possible. (We haven't really enforced it before either, as we simply
- * ignored the batch files.)
+ * buffers. We can't optimize both parts at the same time, and for large
+ * inner_rel_bytes values there may not be a nbatch value that would keep
+ * memory usage below the memory limit (work_mem * hash_mem_multiplier).
*
- * The formula for total memory usage says that given an inner relation of
- * size inner_rel_bytes, we may divide it into an arbitrary number of
- * batches. This determines both the size of the in-memory hash table and
- * the amount of memory needed for batch files. These two terms work in
- * opposite ways - when one decreases, the other increases.
+ * In this case we're going to exceed the memory limit no matter what. We
+ * can at least minimize the impact and minimize the amount of memory
+ * used. (We haven't enforced it before either, we simply ignored the
+ * batch files.)
*
- * For low nbatch values, the hash table takes most of the memory, but at
- * some point the batch files start to dominate. If you combine these two
- * terms, the memory consumption (for a fixed size of the inner relation)
- * has a u-shape, with a minimum at some nbatch value.
- *
- * Our goal is to find this nbatch value, minimizing the memory usage. We
- * calculate the memory usage with half the batches (i.e. nbatch/2), and
- * if it's lower than the current memory usage we know it's better to use
- * fewer batches. We repeat this until reducing the number of batches does
- * not reduce the memory usage - we found the optimum. We know the optimum
- * exists, thanks to the u-shape.
+ * These two terms in the formula work in opposite ways - one decreases,
+ * the other increases. The plot has a u-shape, with a minimum at some
+ * nbatch value. We find the minimum by "walking back" - check if using
+ * (nbatch/2) batches would lower the memory usage. We stop when the
+ * memory usage would not improve.
*
* We only want to do this when exceeding the memory limit, not every
- * time. The goal is not to minimize memory usage in every case, but to
- * minimize the memory usage when we can't stay within the memory limit.
- *
- * For this reason we only consider reducing the number of batches. We
- * could try the opposite direction too, but that would save memory only
- * when most of the memory is used by the hash table. And the hash table
- * was used for the initial sizing, so we shouldn't be exceeding the
- * memory limit too much. We might save memory by using more batches, but
- * it would result in spilling more batch files, which does not seem like
- * a great trade off.
+ * time. This is why we try only reducing number of batches. We could try
+ * increasing nbatch too, but what would lower memory usage only with most
+ * of the memory being used by the hash table. It might even force using
+ * batching even when not needed. We don't want that.
*
* While growing the hashtable, we also adjust the number of buckets, to
* not have more than one tuple per bucket (load factor 1). We can only do
--
2.51.0
From 7f6d163d99651406d36bef712f6fc87f1e3e801c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Wed, 24 Sep 2025 14:56:14 +0200
Subject: [PATCH vfix 3/4] nbuckets overflow protection
---
src/backend/executor/nodeHash.c | 44 ++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 972b5b58583..b2be2091fb2 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -925,14 +925,50 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
break;
/*
- * It's better to use half the batches, so do that and adjust the
- * nbucket in the opposite direction, and double the allowance.
+ * Make sure hash_table_bytes doesn't overflow. If it would, stop and
+ * use the current nbatch value.
+ */
+ if (hash_table_bytes > SIZE_MAX / 2)
+ break;
+
+ /*
+ * It's better to use half the batches, so do that, and adjust the
+ * allowance in the opposite direction. Then adjust the nbuckets to
+ * reflect the higher allowance, but be careful about overflow.
*/
nbatch /= 2;
- nbuckets *= 2;
hash_table_bytes *= 2;
-
*space_allowed = (*space_allowed) * 2;
+
+ /*
+ * Adjust nbuckets, with overflow protection.
+ *
+ * Recalculate max_pointers, because the earlier value was based on a
+ * lower hash_table_bytes, before we adjusted that.
+ *
+ * XXX Maybe this is too much, and we need to care only about
+ * MaxAllocSize? Then we'd not need to recalculate this over and over
+ * in every loop.
+ *
+ * XXX An alternative would be to not grow nbuckets, and stick to the
+ * originally planned value. The cost is we'll increase the load
+ * factor (tuples per bucket).
+ */
+ max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
+ max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
+ /* If max_pointers isn't a power of 2, must round it down to one */
+ max_pointers = pg_prevpower2_size_t(max_pointers);
+
+ /* Also ensure we avoid integer overflow in nbatch and nbuckets */
+ /* (this step is redundant given the current value of MaxAllocSize) */
+ max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
+
+ /* don't break, just cap the number of buckets */
+ dbuckets = (double) nbuckets * 2;
+ dbuckets = Min(dbuckets, max_pointers);
+
+ nbuckets = (int) dbuckets;
+ nbuckets = pg_nextpower2_32(nbuckets);
}
Assert(nbuckets > 0);
--
2.51.0
From a7995c8e6a149e3444e7e48b618a117cda588770 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Wed, 24 Sep 2025 14:32:29 +0200
Subject: [PATCH vfix 2/4] adjust hash_table_bytes for balancing
---
src/backend/executor/nodeHash.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ef589b90339..972b5b58583 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -930,6 +930,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
*/
nbatch /= 2;
nbuckets *= 2;
+ hash_table_bytes *= 2;
*space_allowed = (*space_allowed) * 2;
}
--
2.51.0
From 4f128c5cbdb11c4d9b1882d39b75338c854ba30e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Wed, 24 Sep 2025 14:30:31 +0200
Subject: [PATCH vfix 1/4] fix size overflow
---
src/backend/executor/nodeHash.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..ef589b90339 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -913,10 +913,12 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
while (nbatch > 0)
{
/* how much memory are we using with current nbatch value */
- size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+ size_t current_space =
+ hash_table_bytes + (2 * nbatch * (size_t) BLCKSZ);
/* how much memory would we use with half the batches */
- size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+ size_t new_space =
+ hash_table_bytes * 2 + (nbatch * (size_t) BLCKSZ);
/* If the memory usage would not decrease, we found the optimum. */
if (current_space < new_space)
@@ -995,7 +997,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
* How much additional memory would doubling nbatch use? Each batch may
* require two buffered files (inner/outer), with a BLCKSZ buffer.
*/
- size_t batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+ size_t batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
/*
* Compare the new space needed for doubling nbatch and for enlarging the
--
2.51.0