Hi,
On 2026-02-28 17:21:28 +0200, Ants Aasma wrote:
> On Thu, 19 Feb 2026 at 21:01, Andres Freund <[email protected]> wrote:
> > On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:
> > > On Thu, 19 Feb 2026 at 20:07, Ants Aasma <[email protected]> wrote:
> > > > I was thinking more along the lines of hashing together the pointer
> > > > value and worker number. But something more deterministic would indeed
> > > > be better. How about this?
> > > >
> > > > --- a/src/backend/executor/execGrouping.c
> > > > +++ b/src/backend/executor/execGrouping.c
> > > > @@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
> > > > MemoryContext oldcontext;
> > > > - uint32 hash_iv = 0;
> > > > + uint32 hash_iv = parent->plan->plan_node_id;
> > >
> > > I can confirm that this fixes the issue. A standalone reproducer is here:
> >
> > I think this needs should use something that smears the bits from the
> > plan_id
> > more widely. The hash functions of some types are ... peculiar (partially
> > due
> > to cross-type hashjoin support).
> >
> > I'd also combine it with use_variable_hash_iv, rather than just have
> > use_variable_hash_iv overwrite it.
>
> This problem should only affect cases where multiple hash tables worth
> of tuples are trying to fit into one.
Is that entirely true? If we are feeding one hash table from other hash table,
1:1, and the initial size of the target table is wrong, we probably also won't
handle it in the best way.
That could happen with a group by over a group by (with perhaps some joins
above the first group by). Probably not that common, but ...
I guess another case could be a CTAS of a group by query that is then grouped
by again.
> I think right now that limits it to hash aggregates with parallel
> query. ExecBuildHash32FromAttrs comment says that we should be using
> non-zero init_value only when needed for a marginal performance gain. So I
> limited the plan node id inclusion to use_variable_hash_iv, but made it
> unconditional for aggregates.
That's an argument... I guess we could mostly avoid that though, by just
always combining in EEOP_HASHDATUM_FIRST.
> I used hash_combine to smear together the bits of the two. I think
> that should be good enough. Alternatively a xor of murmurhash32 of the
> two should be better.
Probably it's good enough, but i'd probably still go for
hash_combine(murmurhash32(ParallelWorkerNumber),
murmurhash32(parent->plan->plan_node_id))
or such.
> There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
> that specifies a hash_iv for a hashed sub plan execution.
When you're saying it specifies a hash IV you mean the 0 it passes, not some
more complicated value, right?
> This must match the value BuildTupleHashTable decides on. Worth adding a
> comment?
Yes, I think so.
> From 624b4827e0d480fc16e016c1ad7c5b26f358b6f3 Mon Sep 17 00:00:00 2001
> From: Ants Aasma <[email protected]>
> Date: Sat, 28 Feb 2026 15:08:50 +0200
> Subject: [PATCH v1] Decorrelate nested hash tables
>
> Because hash tables are iterated in hash order, using the same hash
> function in nested hash tables can lead to excessive collisions. If the
> parent hash table can be the same size as the child hash tables it is
> not a problem as the parent will quickly grow to the same size as the
> child, eliminating further collisions. But if there are more than one
> child the collisions can cause the table to quickly grow until it is
> below 10% fillfactor.
>
> The problem is made worse by nodeAgg devolving into building single
> entry batches once hash table size exceeds working memory.
>
> To hit the problem an aggregate node without a final function above
> multiple other aggregate nodes is needed. Because hash iv is already
> initialized using parallel worker number when there is no final fn
> the typical cases do not hit this problem. A hash aggregate implementing
> DISTINCT above multiple parallel workers with more groups than fit into
> memory is needed.
>
> Initializing the hash function based on plan node id decorrelates hash
> tables within a plan while still keeping behavior deterministic.
> Author: Ants Aasma <[email protected]>
> Discussion:
> https://postgr.es/m/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g%40mail.gmail.com
> ---
> src/backend/executor/execGrouping.c | 2 +-
> src/backend/executor/nodeAgg.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/executor/execGrouping.c
> b/src/backend/executor/execGrouping.c
> index c107514a85d..21f5e4cabcc 100644
> --- a/src/backend/executor/execGrouping.c
> +++ b/src/backend/executor/execGrouping.c
> @@ -246,7 +246,7 @@ BuildTupleHashTable(PlanState *parent,
> * underestimated.
> */
> if (use_variable_hash_iv)
> - hash_iv = murmurhash32(ParallelWorkerNumber);
> + hash_iv = hash_combine(ParallelWorkerNumber,
> parent->plan->plan_node_id);
>
> hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
I think some of the reasoning from your commit messages needs to be in a
comment somewhere too, otherwise the next person tackling this code will have
a hell of a time.
> diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
> index 7d487a165fa..5c947fd151f 100644
> --- a/src/backend/executor/nodeAgg.c
> +++ b/src/backend/executor/nodeAgg.c
> @@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double
> nbuckets)
>
> metacxt,
>
> tuplescxt,
>
> tmpcxt,
> -
> DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
> +
> true);
> }
Hm, won't that mean we'll now always use the slower path? I think there are
non-partial cases that could be problematic, but if we do this we probaly
should do the optimization of the hash expression stuff I mentioned above.
Greetings,
Andres Freund