On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <[email protected]> wrote:
>
> On 10/7/25 23:10, Melanie Plageman wrote:
>
> Hmm, so if I understand correctly you suggest stop when nbuckets gets
> too high, while my code allowed reducing nbatches further (and just
> capped nbatches). I'm fine with this change, if it makes the code
> simpler, that means we allow ~130M buckets, which seems rather unlikely
> to be a problem in practice. That means 1GB for buckets alone, and
> tuples tend to be a multiple of that. With 4GB total, that's ~256k
> batches. And multiplied by the 130M that would be 3.5e13 tuples ...
>
> However, I don't understand why the condition bothers about INT_MAX?
>
>     /* Ensure that nbuckets * 2 doesn't overflow an int */
>     if (nbuckets > INT_MAX / 2)
>         break;

You're right. This can just be
        if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
            break;

> It's however true that if we double nbuckets and nbatch at the same
> time, we don't need to bother doing this:
>
>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>
> because we can never break. Although, is that really true? When picking
> the initial nbuckets value we do this:
>
>     nbuckets = Max(nbuckets, 1024);
>
> What if this increased the nbuckets (it'd require extremely low work_mem
> of course)? Could it lead to unexpectedly high nbuckets in the loop?

If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that

        if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
            break;

As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?

> Similarly, why does this check care about number of buckets?
>
>     if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>         break;
>
> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
> size of the hash table in bytes, not in number of items. Also, see how
> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
> continue doing that.

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).

The only exception would be if  MaxAllocSize / sizeof(HashJoinTuple)
was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.

> I like the idea of simplifying the stop condition, instead of
> calculating the old/new size, and comparing those. But is this actually
> correct?
>
>     if (nbatch / 4 < hash_table_bytes / BLCKSZ)
>         break;
>
> If we start with the "full" condition
>
>  hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
>
> subtract hash_bytes from both sides
>
>   (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
>
> subtract (nbatch * BLCKSZ)
>
>   nbatch * BLCKSZ < hash_bytes
>
> that gives us
>
>   nbatch < (hash_bytes / BLCKSZ)
>
> So where did the /4 come from?

Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.

I think it should be
        if (nbatch < hash_table_bytes / BLCKSZ)
as you say

> Also, maybe it'd be easier to just do
>   (nbatch * BLCKSZ) < hash_bytes
>
> i.e. without the division.

I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.

> > I'm still waiting for the 4billion row table to be created on my
> > machine, so I haven't verified that I get the same results as you yet.

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh, and I get the same logging output results as your patch with attached v2.

- Melanie
From 716b7c939019294c2e006e55da44180f61072fe2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 7 Oct 2025 13:21:14 -0400
Subject: [PATCH v2] Fix overflow risk in hashtable size calculation

ci-os-only:
---
 src/backend/executor/nodeHash.c | 115 +++++++++++++++-----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a3415db4e20..e8be61e40ed 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -850,90 +850,85 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Optimize the total amount of memory consumed by the hash node.
 	 *
-	 * The nbatch calculation above focuses on the size of the in-memory hash
-	 * table, assuming no per-batch overhead. Now adjust the number of batches
-	 * and the size of the hash table to minimize total memory consumed by the
-	 * hash node.
-	 *
-	 * Each batch file has a BLCKSZ buffer, and we may need two files per
-	 * batch (inner and outer side). So with enough batches this can be
-	 * significantly more memory than the hashtable itself.
+	 * The nbatch calculation above focuses on the the in-memory hash table,
+	 * assuming no per-batch overhead. But each batch may have two files, each
+	 * with a BLCKSZ buffer. With enough batches these buffers may use
+	 * significantly more memory than the hash table.
 	 *
 	 * The total memory usage may be expressed by this formula:
 	 *
-	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
+	 * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ)
 	 *
 	 * where (inner_rel_bytes / nbatch) is the size of the in-memory hash
 	 * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
-	 * buffers. But for sufficiently large values of inner_rel_bytes value
-	 * there may not be a nbatch value that would make both parts fit into
-	 * hash_table_bytes.
-	 *
-	 * In this case we can't enforce the memory limit - we're going to exceed
-	 * it. We can however minimize the impact and use as little memory as
-	 * possible. (We haven't really enforced it before either, as we simply
-	 * ignored the batch files.)
-	 *
-	 * The formula for total memory usage says that given an inner relation of
-	 * size inner_rel_bytes, we may divide it into an arbitrary number of
-	 * batches. This determines both the size of the in-memory hash table and
-	 * the amount of memory needed for batch files. These two terms work in
-	 * opposite ways - when one decreases, the other increases.
+	 * buffers.
 	 *
-	 * For low nbatch values, the hash table takes most of the memory, but at
-	 * some point the batch files start to dominate. If you combine these two
-	 * terms, the memory consumption (for a fixed size of the inner relation)
-	 * has a u-shape, with a minimum at some nbatch value.
+	 * For very large inner_rel_bytes, there may be no nbatch that keeps total
+	 * memory usage under the budget (work_mem * hash_mem_multiplier). In that
+	 * case, we choose nbatch to minimize total memory consumption across both
+	 * the hashtable and file buffers.
 	 *
-	 * Our goal is to find this nbatch value, minimizing the memory usage. We
-	 * calculate the memory usage with half the batches (i.e. nbatch/2), and
-	 * if it's lower than the current memory usage we know it's better to use
-	 * fewer batches. We repeat this until reducing the number of batches does
-	 * not reduce the memory usage - we found the optimum. We know the optimum
-	 * exists, thanks to the u-shape.
+	 * As we increase the size of the hashtable and decrease the number of
+	 * batches, the total memory usage follows a U-shaped curve. We find the
+	 * minimum nbatch by "walking back" -- checking if halving nbatch would
+	 * lower total memory usage. We stop when it no longer helps.
 	 *
-	 * We only want to do this when exceeding the memory limit, not every
-	 * time. The goal is not to minimize memory usage in every case, but to
-	 * minimize the memory usage when we can't stay within the memory limit.
+	 * We only want to do this when we are already over the memory budget. We
+	 * don't explore increasing nbatch since that could force batching when it
+	 * is not needed.
 	 *
-	 * For this reason we only consider reducing the number of batches. We
-	 * could try the opposite direction too, but that would save memory only
-	 * when most of the memory is used by the hash table. And the hash table
-	 * was used for the initial sizing, so we shouldn't be exceeding the
-	 * memory limit too much. We might save memory by using more batches, but
-	 * it would result in spilling more batch files, which does not seem like
-	 * a great trade off.
+	 * While growing the hashtable, we also adjust the number of buckets to
+	 * maintain a load factor of NTUP_PER_BUCKET while squeezing tuples back
+	 * from batches into the hashtable.
 	 *
-	 * While growing the hashtable, we also adjust the number of buckets, to
-	 * not have more than one tuple per bucket (load factor 1). We can only do
-	 * this during the initial sizing - once we start building the hash,
-	 * nbucket is fixed.
+	 * Note that we can only change nbuckets during initial hashtable sizing.
+	 * Once we start building the hash, nbuckets is fixed.
 	 */
-	while (nbatch > 0)
+	while (nbatch > 1)
 	{
-		/* how much memory are we using with current nbatch value */
-		size_t		current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
+		/*
+		 * Ensure that nbuckets can be allocated. nbuckets may have been
+		 * increased above hash_table_size / sizeof(HashJoinTuple) if work_mem
+		 * was very small, so checking if MaxAllocSize is enough for the
+		 * buckets is the safest check.
+		 */
+		if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
+			break;
 
-		/* how much memory would we use with half the batches */
-		size_t		new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
+		/*
+		 * We shouldn't overflow a size_t if we break before hash_table_bytes
+		 * would exceed MaxAllocSize. We use space_allowed because it will be
+		 * larger than hash_table_bytes when there is a skew hashtable.
+		 */
+		Assert((*space_allowed) < SIZE_MAX / 2);
 
-		/* If the memory usage would not decrease, we found the optimum. */
-		if (current_space < new_space)
+		/*
+		 * This is the same as:
+		 *
+		 * hash_table_bytes + 2 * nbatch * BLCKSZ < hash_table_bytes * 2 +
+		 * nbatch * BLCKSZ
+		 *
+		 * avoiding intermediate overflow.
+		 *
+		 * which is to say: will halving the number of batches and doubling
+		 * the size of the hashtable reduce overall memory usage?
+		 */
+		if (nbatch < hash_table_bytes / BLCKSZ)
 			break;
 
 		/*
-		 * It's better to use half the batches, so do that and adjust the
-		 * nbucket in the opposite direction, and double the allowance.
+		 * MaxAllocSize is sufficiently small that we are not worried about
+		 * overflowing nbuckets.
 		 */
-		nbatch /= 2;
 		nbuckets *= 2;
-
+		hash_table_bytes *= 2;
 		*space_allowed = (*space_allowed) * 2;
+
+		nbatch /= 2;
 	}
 
 	Assert(nbuckets > 0);
 	Assert(nbatch > 0);
-
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
 }
@@ -994,7 +989,7 @@ ExecHashIncreaseBatchSize(HashJoinTable hashtable)
 	 * How much additional memory would doubling nbatch use? Each batch may
 	 * require two buffered files (inner/outer), with a BLCKSZ buffer.
 	 */
-	size_t		batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
+	size_t		batchSpace = (hashtable->nbatch * 2 * (size_t) BLCKSZ);
 
 	/*
 	 * Compare the new space needed for doubling nbatch and for enlarging the
-- 
2.43.0

Reply via email to