On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <[email protected]> wrote:
>
> 1) A couple comments adjusted. It feels a bit too audacious to correct
> comments written by native speaker, but it seems cleaner to me like this.

I attached a patch with a few more suggested adjustments (0003). The
more substantive tweaks are:

I don't really like that this comment says it is about nbuckets
overflowing MaxAllocSize because overflow means something specific and
this sounds like we are saying the nbuckets variable will overflow
MaxAllocSize but what we mean is that nbuckets worth of HashJoinTuples
could overflow MaxAllocSize. You don't have to use my wording, but I'm
not sure about this wording either.

/* Check that nbuckets wont't overflow MaxAllocSize */
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
        break;

Also, upon re-reading the comment we wrote together:

        * With extremely low work_mem values, nbuckets may have been set
        * higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
        * to correct that here, we accept nbuckets to be oversized.

I'm not so sure it belongs above the nbuckets allocation check.
Perhaps we should move it to where we are doubling nbuckets. Or even
above where we clamp it to 1024.

I'm actually wondering if we want this comment at all. Does it
emphasize an edge case to a confusing point? I'm imagining myself
studying it in the future and having no idea what it means.

I've kept it in but moved it in 0003.

> 4) added the doubling of num_skew_mcvs, and also the overflow protection
> for that

Do we really need to check if num_skew_mcvs will overflow? Shouldn't
it always be smaller than nbuckets? Maybe it can be an assert.

> You suggested this in another message:
>
> > We do need to recalculate num_skew_mcvs once at the end before
> > returning.
>
> But I think the doubling has the same effect, right? I don't want to
> redo the whole "if (useskew) { ... }" block at the end.

Yea, it would have to be some kind of helper or something. I worried
just doubling num_skew_mcvs would drift significantly because of
integer truncation -- perhaps even a meaningful amount. But it was
just an intuition -- I didn't plug in any numbers and try.

- Melanie
From 516de50b0fe2075955dab2168a7e46f0f9222278 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v5 1/3] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..0f815674c9f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We check that space_allowed wouldn't overflow because it will be
+		 * larger than hash_table_bytes if there is a skew hashtable.
+		 */
+		if ((*space_allowed) > SIZE_MAX / 2)
+			break;
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

From 27544e1504a0f014165e1120e5230e22a424b8b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 10 Oct 2025 00:38:05 +0200
Subject: [PATCH v5 2/3] adjustments

---
 src/backend/executor/nodeHash.c | 43 ++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0f815674c9f..264c5b7ed40 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -883,37 +883,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 *
 	 * Note that we can only change nbuckets during initial hashtable sizing.
 	 * Once we start building the hash, nbuckets is fixed.
+	 *
+	 * We will double a number of parameters (space_allowed, nbuckets and
+	 * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
+	 * if that happens. We could do something smarter (e.g. cap nbuckets and
+	 * continue), but the complexity is not worth it. It's extremely rare and
+	 * this is best-effort attempt to reduce memory usage.
 	 */
 	while (nbatch > 1)
 	{
 		/*
-		 * Ensure that nbuckets can be allocated. nbuckets may have been
-		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
-		 * was very small, so checking if MaxAllocSize is enough for the
-		 * buckets is the safest check.
+		 * Check that nbuckets wont't overflow MaxAllocSize.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set
+		 * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
+		 * to correct that here, we accept nbuckets to be oversized.
 		 */
 		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
 		/*
-		 * We check that space_allowed wouldn't overflow because it will be
-		 * larger than hash_table_bytes if there is a skew hashtable.
+		 * Check that space_allowed won't overflow.
+		 *
+		 * We don't check hash_table_bytes, because we subtracted space for
+		 * skew hashtable from it (so it may not be equal to space_allowed).
 		 */
 		if ((*space_allowed) > SIZE_MAX / 2)
 			break;
 
 		/*
-		 * This is the same as:
+		 * Check that num_skew_mcvs won't overflow.
+		 */
+		if ((*num_skew_mcvs) > INT_MAX / 2)
+			break;
+
+		/*
+		 * Will halving the number of batches and doubling the size of the
+		 * hashtable reduce overall memory usage?
 		 *
-		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
-		 * nbatch * BLCKSZ
+		 * This is the same as (S = space_allowed):
 		 *
-		 * avoiding intermediate overflow.
+		 * (S + 2 * nbatch * BLCKSZ) < (S * 2 + nbatch * BLCKSZ)
 		 *
-		 * which is to say: will halving the number of batches and doubling
-		 * the size of the hashtable reduce overall memory usage?
+		 * but avoiding intermediate overflow.
 		 */
-		if (nbatch < hash_table_bytes / BLCKSZ)
+		if (nbatch < *space_allowed / BLCKSZ)
 			break;
 
 		/*
@@ -921,7 +935,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		 * overflowing nbuckets.
 		 */
 		nbuckets *= 2;
-		hash_table_bytes *= 2;
+
+		*num_skew_mcvs = (*num_skew_mcvs) * 2;
 		*space_allowed = (*space_allowed) * 2;
 
 		nbatch /= 2;
-- 
2.43.0

From 882905cabe42d061aad6b895602710beb5d81297 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 13 Oct 2025 12:03:54 -0400
Subject: [PATCH v5 3/3] more tweaks

---
 src/backend/executor/nodeHash.c | 39 ++++++++++++++-------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 264c5b7ed40..f562077df5e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -884,37 +884,27 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Note that we can only change nbuckets during initial hashtable sizing.
 	 * Once we start building the hash, nbuckets is fixed.
 	 *
-	 * We will double a number of parameters (space_allowed, nbuckets and
-	 * num_skew_mcvs), which brings the overflow risk. We simply stop the loop
-	 * if that happens. We could do something smarter (e.g. cap nbuckets and
-	 * continue), but the complexity is not worth it. It's extremely rare and
-	 * this is best-effort attempt to reduce memory usage.
+	 * We double several parameters (space_allowed, nbuckets, and
+	 * num_skew_mcvs), which introduces a risk of overflow. We avoid this by
+	 * exiting the loop. We could do something smarter (e.g. capping nbuckets
+	 * and continue), but the complexity is not worth it. Such cases are
+	 * extremely rare, and this is a best-effort attempt to reduce memory
+	 * usage.
 	 */
 	while (nbatch > 1)
 	{
-		/*
-		 * Check that nbuckets wont't overflow MaxAllocSize.
-		 *
-		 * With extremely low work_mem values, nbuckets may have been set
-		 * higher than hash_table_size / sizeof(HashJoinTuple). We don't try
-		 * to correct that here, we accept nbuckets to be oversized.
-		 */
+		/* Check that nbuckets wont't overflow MaxAllocSize */
 		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
 			break;
 
-		/*
-		 * Check that space_allowed won't overflow.
-		 *
-		 * We don't check hash_table_bytes, because we subtracted space for
-		 * skew hashtable from it (so it may not be equal to space_allowed).
-		 */
-		if ((*space_allowed) > SIZE_MAX / 2)
-			break;
+		/* num_skew_mcvs should be less than nbuckets */
+		Assert((*num_skew_mcvs) < INT_MAX / 2);
 
 		/*
-		 * Check that num_skew_mcvs won't overflow.
+		 * space_allowed will exceed hash_table_bytes when there is a
+		 * skew_hashtable. Just exit if doubling it will overflow.
 		 */
-		if ((*num_skew_mcvs) > INT_MAX / 2)
+		if ((*space_allowed) > SIZE_MAX / 2)
 			break;
 
 		/*
@@ -933,6 +923,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		/*
 		 * MaxAllocSize is sufficiently small that we are not worried about
 		 * overflowing nbuckets.
+		 *
+		 * With extremely low work_mem values, nbuckets may have been set
+		 * higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
+		 * to correct that here.
 		 */
 		nbuckets *= 2;
 
@@ -944,6 +938,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
+
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
-- 
2.43.0

Reply via email to