On 10/9/25 16:30, Tomas Vondra wrote:
> On 10/9/25 16:16, Melanie Plageman wrote:
>> On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <[email protected]> wrote:
>>>
>>> On 10/8/25 19:37, Melanie Plageman wrote:
>>>
>>>> Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
>>>>
>>>> hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
>>>> sizeof(HashJoinTuple) / 2
>>>
>>> But the hash table is not allocated as a single chunk of memory, so I
>>> think MaxAllocSize would be the wrong thing to use here, no? Hash table
>>> is a collection of tuples (allocated one by one), that's why
>>> get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
>>> nbuckets, because that indeed is allocated as a contiguous array, ofc.
>>>
>>> Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
>>> we get
>>>
>>> hash_table_bytes > MaxAllocSize / 2
>>>
>>> but again, that doesn't make much sense - the hash table can be larger,
>>> it's not a single palloc.
>>
>> It came from the earlier clamping of nbuckets:
>>
>> max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>> max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
>> dbuckets = Min(dbuckets, max_pointers);
>>
>> I don't really get why this divides hash_table_bytes by
>> sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
>> to accommodate both the bucket_bytes and inner_rel_bytes.
>>
>
> I think that's simply to allocate enough buckets for the the expected
> number of tuples, not more. And cap it to MaxAllocSize. Some of this may
> be redundant, I suppose.
>
>>> And I think it should be just
>>>
>>> if (hash_table_bytes > SIZE_MAX / 2)
>>> break;
>>>
>>> as a protection against hash_table_bytes overflowing SIZE_MAX (which on
>>> 64-bits seems implausible, but could happen on 32-bit builds).
>>
>> That's roughly the check I ended up with -- except I used
>> space_allowed because it will be larger than hash_table_bytes if there
>> is a skew hashtable. I've updated the comment and such in attached v3.
>>
>
> Ah, thanks. I was just hacking on this too, but I'll switch to your v3.
>
Here's a v4, which is your v3 with a couple minor adjustments:
1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.
2) I think the (nbatch < hash_table_bytes / BLCKSZ) condition really
needs to check space_allowed, because that's the whole space, before
subtracting the skew buckets. And we also check the overflow for
space_allowed earlier, not hash_table_bytes.
3) removes the hash_table_bytes doubling, because it's not needed by the
loop anymore (or after that)
4) added the doubling of num_skew_mcvs, and also the overflow protection
for that
You suggested this in another message:
> We do need to recalculate num_skew_mcvs once at the end before
> returning.
But I think the doubling has the same effect, right? I don't want to
redo the whole "if (useskew) { ... }" block at the end.
I wonder if maybe it'd be better to just merge all the overflow checks
into a single "if (a || b || c)" check. These separate checks seem quite
verbose. I'll see tomorrow.
I've done a bit more testing today. I went a bit overboard and created a
table with 20B rows, which actually pushes nbatch high enough to hit the
initial issue (overflow in the size calculation). And it works sanely
with the v4 patch (and v3 too).
I guess I could have tweaked the table stats, or just manipulate the
values at the beginning of the function. But that wouldn't be so fun.
regards
--
Tomas Vondra
From d3c67d91f68cadec7c395da933d6e56ba600b94d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 10 Oct 2025 00:38:05 +0200
Subject: [PATCH v4 2/2] adjustments
---
src/backend/executor/nodeHash.c | 43 ++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 14 deletions(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0f815674c9f..264c5b7ed40 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -883,37 +883,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
*
* Note that we can only change nbuckets during initial hashtable sizing.
* Once we start building the hash, nbuckets is fixed.
+ *
+ * We will double a number of parameters (space_allowed, nbuckets and
+ * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
+ * if that happens. We could do something smarter (e.g. cap nbuckets and
+ * continue), but the complexity is not worth it. It's extremely rare and
+ * this is best-effort attempt to reduce memory usage.
*/
while (nbatch > 1)
{
/*
- * Ensure that nbuckets can be allocated. nbuckets may have been
- * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
- * was very small, so checking if MaxAllocSize is enough for the
- * buckets is the safest check.
+ * Check that nbuckets wont't overflow MaxAllocSize.
+ *
+ * With extremely low work_mem values, nbuckets may have been set
+ * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
+ * to correct that here, we accept nbuckets to be oversized.
*/
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;
/*
- * We check that space_allowed wouldn't overflow because it will be
- * larger than hash_table_bytes if there is a skew hashtable.
+ * Check that space_allowed won't overflow.
+ *
+ * We don't check hash_table_bytes, because we subtracted space for
+ * skew hashtable from it (so it may not be equal to space_allowed).
*/
if ((*space_allowed) > SIZE_MAX / 2)
break;
/*
- * This is the same as:
+ * Check that num_skew_mcvs won't overflow.
+ */
+ if ((*num_skew_mcvs) > INT_MAX / 2)
+ break;
+
+ /*
+ * Will halving the number of batches and doubling the size of the
+ * hashtable reduce overall memory usage?
*
- * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
- * nbatch * BLCKSZ
+ * This is the same as (S = space_allowed):
*
- * avoiding intermediate overflow.
+ * (S + 2 * nbatch * BLCKSZ) < (S * 2 + nbatch * BLCKSZ)
*
- * which is to say: will halving the number of batches and doubling
- * the size of the hashtable reduce overall memory usage?
+ * but avoiding intermediate overflow.
*/
- if (nbatch < hash_table_bytes / BLCKSZ)
+ if (nbatch < *space_allowed / BLCKSZ)
break;
/*
@@ -921,7 +935,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
* overflowing nbuckets.
*/
nbuckets *= 2;
- hash_table_bytes *= 2;
+
+ *num_skew_mcvs = (*num_skew_mcvs) * 2;
*space_allowed = (*space_allowed) * 2;
nbatch /= 2;
--
2.51.0
From aee3da1fe907ba837a45825a158ab2e2507d3ed2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v4 1/2] Fix overflow risk in hashtable size calculation
ci-os-only:
---
src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
1 file changed, 55 insertions(+), 60 deletions(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..0f815674c9f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
/*
* Optimize the total amount of memory consumed by the hash node.
*
- * The nbatch calculation above focuses on the size of the in-memory hash
- * table, assuming no per-batch overhead. Now adjust the number of batches
- * and the size of the hash table to minimize total memory consumed by the
- * hash node.
- *
- * Each batch file has a BLCKSZ buffer, and we may need two files per
- * batch (inner and outer side). So with enough batches this can be
- * significantly more memory than the hashtable itself.
+ * The nbatch calculation above focuses on the the in-memory hash table,
+ * assuming no per-batch overhead. But each batch may have two files, each
+ * with a BLCKSZ buffer. With enough batches these buffers may use
+ * significantly more memory than the hash table.
*
* The total memory usage may be expressed by this formula:
*
- * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+ * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
*
* where (inner_rel_bytes / nbatch) is the size of the in-memory hash
* table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
- * buffers. But for sufficiently large values of inner_rel_bytes value
- * there may not be a nbatch value that would make both parts fit into
- * hash_table_bytes.
- *
- * In this case we can't enforce the memory limit - we're going to exceed
- * it. We can however minimize the impact and use as little memory as
- * possible. (We haven't really enforced it before either, as we simply
- * ignored the batch files.)
- *
- * The formula for total memory usage says that given an inner relation of
- * size inner_rel_bytes, we may divide it into an arbitrary number of
- * batches. This determines both the size of the in-memory hash table and
- * the amount of memory needed for batch files. These two terms work in
- * opposite ways - when one decreases, the other increases.
+ * buffers.
*
- * For low nbatch values, the hash table takes most of the memory, but at
- * some point the batch files start to dominate. If you combine these two
- * terms, the memory consumption (for a fixed size of the inner relation)
- * has a u-shape, with a minimum at some nbatch value.
+ * For very large inner_rel_bytes, there may be no nbatch that keeps total
+ * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+ * case, we choose nbatch to minimize total memory consumption across both
+ * the hashtable and file buffers.
*
- * Our goal is to find this nbatch value, minimizing the memory usage. We
- * calculate the memory usage with half the batches (i.e. nbatch/2), and
- * if it's lower than the current memory usage we know it's better to use
- * fewer batches. We repeat this until reducing the number of batches does
- * not reduce the memory usage - we found the optimum. We know the optimum
- * exists, thanks to the u-shape.
+ * As we increase the size of the hashtable and decrease the number of
+ * batches, the total memory usage follows a U-shaped curve. We find the
+ * minimum nbatch by "walking back" -- checking if halving nbatch would
+ * lower total memory usage. We stop when it no longer helps.
*
- * We only want to do this when exceeding the memory limit, not every
- * time. The goal is not to minimize memory usage in every case, but to
- * minimize the memory usage when we can't stay within the memory limit.
+ * We only want to do this when we are already over the memory budget. We
+ * don't explore increasing nbatch since that could force batching when it
+ * is not needed.
*
- * For this reason we only consider reducing the number of batches. We
- * could try the opposite direction too, but that would save memory only
- * when most of the memory is used by the hash table. And the hash table
- * was used for the initial sizing, so we shouldn't be exceeding the
- * memory limit too much. We might save memory by using more batches, but
- * it would result in spilling more batch files, which does not seem like
- * a great trade off.
+ * While growing the hashtable, we also adjust the number of buckets to
+ * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+ * from batches into the hashtable.
*
- * While growing the hashtable, we also adjust the number of buckets, to
- * not have more than one tuple per bucket (load factor 1). We can only do
- * this during the initial sizing - once we start building the hash,
- * nbucket is fixed.
+ * Note that we can only change nbuckets during initial hashtable sizing.
+ * Once we start building the hash, nbuckets is fixed.
*/
- while (nbatch > 0)
+ while (nbatch > 1)
{
- /* how much memory are we using with current nbatch value */
- size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+ /*
+ * Ensure that nbuckets can be allocated. nbuckets may have been
+ * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+ * was very small, so checking if MaxAllocSize is enough for the
+ * buckets is the safest check.
+ */
+ if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+ break;
- /* how much memory would we use with half the batches */
- size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+ /*
+ * We check that space_allowed wouldn't overflow because it will be
+ * larger than hash_table_bytes if there is a skew hashtable.
+ */
+ if ((*space_allowed) > SIZE_MAX / 2)
+ break;
- /* If the memory usage would not decrease, we found the optimum. */
- if (current_space < new_space)
+ /*
+ * This is the same as:
+ *
+ * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+ * nbatch * BLCKSZ
+ *
+ * avoiding intermediate overflow.
+ *
+ * which is to say: will halving the number of batches and doubling
+ * the size of the hashtable reduce overall memory usage?
+ */
+ if (nbatch < hash_table_bytes / BLCKSZ)
break;
/*
- * It's better to use half the batches, so do that and adjust the
- * nbucket in the opposite direction, and double the allowance.
+ * MaxAllocSize is sufficiently small that we are not worried about
+ * overflowing nbuckets.
*/
- nbatch /= 2;
nbuckets *= 2;
-
+ hash_table_bytes *= 2;
*space_allowed = (*space_allowed) * 2;
+
+ nbatch /= 2;
}
Assert(nbuckets > 0);
Assert(nbatch > 0);
-
*numbuckets = nbuckets;
*numbatches = nbatch;
}
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
* How much additional memory would doubling nbatch use? Each batch may
* require two buffered files (inner/outer), with a BLCKSZ buffer.
*/
- size_t batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+ size_t batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
/*
* Compare the new space needed for doubling nbatch and for enlarging the
--
2.51.0