On Sat, 28 Feb 2026 at 22:03, Andres Freund <[email protected]> wrote:
> > This problem should only affect cases where multiple hash tables worth
> > of tuples are trying to fit into one.
>
> Is that entirely true? If we are feeding one hash table from other hash table,
> 1:1, and the initial size of the target table is wrong, we probably also won't
> handle it in the best way.
>
> That could happen with a group by over a group by (with perhaps some joins
> above the first group by). Probably not that common, but ...
>
> I guess another case could be a CTAS of a group by query that is then grouped
> by again.

With 1:1 tables, if the target table is smaller there is going to be a
ton of collisions which should quite quickly trigger growth via
SH_GROW_MAX_MOVE (150). But once the target table is the same size as
source the growth will stop. I think this ends up being slightly more
efficient in the correlated case because the table will be growing at
10% fill factor, having less entries to move around. And the access to
the target hash table will be more correlated over time for better
cache efficiency.

However I think it might be possible to have different hash table size
limits in a single plan. At least setting hash_mem_multiplier on a
function would do it. But I wasn't able to trigger any bad behavior
with this for some reason. I will experiment some more.

> > I think right now that limits it to hash aggregates with parallel
> > query. ExecBuildHash32FromAttrs comment says that we should be using
> > non-zero init_value only when needed for a marginal performance gain. So I
> > limited the plan node id inclusion to use_variable_hash_iv, but made it
> > unconditional for aggregates.
>
> That's an argument...  I guess we could mostly avoid that though, by just
> always combining in EEOP_HASHDATUM_FIRST.

I think that makes it exactly the same as EEOP_HASHDATUM_NEXT32,
making the special op code pointless. I assume David observed some
speedup there to go through the effort to include it.

>> > I used hash_combine to smear together the bits of the two. I think
> > that should be good enough. Alternatively a xor of murmurhash32 of the
> > two should be better.
>
> Probably it's good enough, but i'd probably still go for
>   hash_combine(murmurhash32(ParallelWorkerNumber),
>                murmurhash32(parent->plan->plan_node_id))
> or such.

Used this approach.

> > There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
> > that specifies a hash_iv for a hashed sub plan execution.
>
> When you're saying it specifies a hash IV you mean the 0 it passes, not some
> more complicated value, right?

Right.

>
> > This must match the value BuildTupleHashTable decides on. Worth adding a
> > comment?
>
> Yes, I think so.

Done.

> >        */
> >       if (use_variable_hash_iv)
> > -             hash_iv = murmurhash32(ParallelWorkerNumber);
> > +             hash_iv = hash_combine(ParallelWorkerNumber, 
> > parent->plan->plan_node_id);
> >
> >       hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
>
> I think some of the reasoning from your commit messages needs to be in a
> comment somewhere too, otherwise the next person tackling this code will have
> a hell of a time.

I gave it a shot.

> > diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
> > index 7d487a165fa..5c947fd151f 100644
> > --- a/src/backend/executor/nodeAgg.c
> > +++ b/src/backend/executor/nodeAgg.c
> > @@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, 
> > double nbuckets)
> >                                                                             
> >            metacxt,
> >                                                                             
> >            tuplescxt,
> >                                                                             
> >            tmpcxt,
> > -                                                                           
> >            DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
> > +                                                                           
> >            true);
> >  }
>
> Hm, won't that mean we'll now always use the slower path?  I think there are
> non-partial cases that could be problematic, but if we do this we probaly
> should do the optimization of the hash expression stuff I mentioned above.

If you mean EEOP_HASHDATUM_FIRST, then I think that would just be
removing an optimization. I think getting rid of the quicker growth
and correlated table filling in the 1:1 case might be more significant
than the slightly slower hash calculation. However the downside of the
bad case is pretty nasty. Do you think it's worth the effort of trying
to figure out if we are in a dangerous looking structure?

Decorrelating the hash tables still doesn't fix the terrible edge case
of hash aggregate exceeding memory limit with hash table alone. Just
makes it exceedingly improbable to hit with non-antagonistic
workloads. Would it be a good idea to fix that too? A simple answer
would be to always allow hash aggregate to use some fraction like 25%
of allowed memory for tuples, even if the hash table is too big.

Regards,
Ants Aasma
From 7a2ac9df4717e95475e10e660264dba550d5bee6 Mon Sep 17 00:00:00 2001
From: Ants Aasma <[email protected]>
Date: Sat, 28 Feb 2026 15:08:50 +0200
Subject: [PATCH v2] Decorrelate nested hash tables

Because hash tables are iterated in hash order, using the same hash
function in nested hash tables can lead to excessive collisions. If the
parent hash table can be the same size as the child hash tables it is
not a problem as the parent will quickly grow to the same size as the
child, eliminating further collisions. But if there are more than one
child the collisions can cause the table to quickly grow until it is
below 10% fillfactor.

The problem is made worse by nodeAgg devolving into building single
entry batches once hash table size exceeds working memory.

To hit the problem an aggregate node without a final function above
multiple other aggregate nodes is needed. Because hash iv is already
initialized using parallel worker number when there is no final fn
the typical cases do not hit this problem. A hash aggregate implementing
DISTINCT above multiple parallel workers with more groups than fit into
memory is needed.

Initializing the hash function based on plan node id decorrelates hash
tables within a plan while still keeping behavior deterministic.

Author: Ants Aasma <[email protected]>
Discussion: https://postgr.es/m/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g%40mail.gmail.com
---
 src/backend/executor/execGrouping.c | 16 +++++++++++-----
 src/backend/executor/nodeAgg.c      |  2 +-
 src/test/regress/expected/union.out |  2 +-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index c107514a85d..bfdcaca4e7d 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -199,6 +199,10 @@ BuildTupleHashTable(PlanState *parent,
 	TupleHashTable hashtable;
 	uint32		nbuckets;
 	MemoryContext oldcontext;
+	/*
+	 * The value here must match what ExecInitSubPlan passes to
+	 * ExecBuildHash32FromAttrs.
+	 */
 	uint32		hash_iv = 0;
 
 	/*
@@ -240,13 +244,15 @@ BuildTupleHashTable(PlanState *parent,
 	/*
 	 * If parallelism is in use, even if the leader backend is performing the
 	 * scan itself, we don't want to create the hashtable exactly the same way
-	 * in all workers. As hashtables are iterated over in keyspace-order,
-	 * doing so in all processes in the same way is likely to lead to
-	 * "unbalanced" hashtables when the table size initially is
-	 * underestimated.
+	 * in all workers. As hashtables are iterated over in hash value order,
+	 * inserting into a smaller table or from multiple tables is going to
+	 * cause collisions which triggers hash table growth. In hash aggregates
+	 * a too large hash table causes excessive spilling as there is no memory
+	 * left over for tuples.
 	 */
 	if (use_variable_hash_iv)
-		hash_iv = murmurhash32(ParallelWorkerNumber);
+		hash_iv = hash_combine(murmurhash32(ParallelWorkerNumber),
+							   murmurhash32(parent->plan->plan_node_id));
 
 	hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7d487a165fa..5c947fd151f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets)
 											 metacxt,
 											 tuplescxt,
 											 tmpcxt,
-											 DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+											 true);
 }
 
 /*
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 709c85f2294..fad22c0ed2d 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -551,8 +551,8 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va
    x   
 -------
  {1,4}
- {1,2}
  {1,3}
+ {1,2}
 (3 rows)
 
 explain (costs off)
-- 
2.51.0

Reply via email to