Re: Hash aggregate collisions cause excessive spilling

Andres Freund Sat, 28 Feb 2026 12:03:37 -0800

Hi,

On 2026-02-28 17:21:28 +0200, Ants Aasma wrote:
> On Thu, 19 Feb 2026 at 21:01, Andres Freund <[email protected]> wrote:
> > On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:
> > > On Thu, 19 Feb 2026 at 20:07, Ants Aasma <[email protected]> wrote:
> > > > I was thinking more along the lines of hashing together the pointer
> > > > value and worker number. But something more deterministic would indeed
> > > > be better. How about this?
> > > >
> > > > --- a/src/backend/executor/execGrouping.c
> > > > +++ b/src/backend/executor/execGrouping.c
> > > > @@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
> > > >         MemoryContext oldcontext;
> > > > -       uint32          hash_iv = 0;
> > > > +       uint32          hash_iv = parent->plan->plan_node_id;
> > >
> > > I can confirm that this fixes the issue. A standalone reproducer is here:
> >
> > I think this needs should use something that smears the bits from the 
> > plan_id
> > more widely. The hash functions of some types are ... peculiar (partially 
> > due
> > to cross-type hashjoin support).
> >
> > I'd also combine it with use_variable_hash_iv, rather than just have
> > use_variable_hash_iv overwrite it.
> 
> This problem should only affect cases where multiple hash tables worth
> of tuples are trying to fit into one.


Is that entirely true? If we are feeding one hash table from other hash table,
1:1, and the initial size of the target table is wrong, we probably also won't
handle it in the best way.

That could happen with a group by over a group by (with perhaps some joins
above the first group by). Probably not that common, but ...

I guess another case could be a CTAS of a group by query that is then grouped
by again.



> I think right now that limits it to hash aggregates with parallel
> query. ExecBuildHash32FromAttrs comment says that we should be using
> non-zero init_value only when needed for a marginal performance gain. So I
> limited the plan node id inclusion to use_variable_hash_iv, but made it
> unconditional for aggregates.

That's an argument...  I guess we could mostly avoid that though, by just
always combining in EEOP_HASHDATUM_FIRST.



> I used hash_combine to smear together the bits of the two. I think
> that should be good enough. Alternatively a xor of murmurhash32 of the
> two should be better.

Probably it's good enough, but i'd probably still go for
  hash_combine(murmurhash32(ParallelWorkerNumber),
               murmurhash32(parent->plan->plan_node_id))
or such.


> There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
> that specifies a hash_iv for a hashed sub plan execution.

When you're saying it specifies a hash IV you mean the 0 it passes, not some
more complicated value, right?


> This must match the value BuildTupleHashTable decides on. Worth adding a
> comment?

Yes, I think so.



> From 624b4827e0d480fc16e016c1ad7c5b26f358b6f3 Mon Sep 17 00:00:00 2001
> From: Ants Aasma <[email protected]>
> Date: Sat, 28 Feb 2026 15:08:50 +0200
> Subject: [PATCH v1] Decorrelate nested hash tables
> 
> Because hash tables are iterated in hash order, using the same hash
> function in nested hash tables can lead to excessive collisions. If the
> parent hash table can be the same size as the child hash tables it is
> not a problem as the parent will quickly grow to the same size as the
> child, eliminating further collisions. But if there are more than one
> child the collisions can cause the table to quickly grow until it is
> below 10% fillfactor.
> 
> The problem is made worse by nodeAgg devolving into building single
> entry batches once hash table size exceeds working memory.
> 
> To hit the problem an aggregate node without a final function above
> multiple other aggregate nodes is needed. Because hash iv is already
> initialized using parallel worker number when there is no final fn
> the typical cases do not hit this problem. A hash aggregate implementing
> DISTINCT above multiple parallel workers with more groups than fit into
> memory is needed.
> 
> Initializing the hash function based on plan node id decorrelates hash
> tables within a plan while still keeping behavior deterministic.

> Author: Ants Aasma <[email protected]>
> Discussion: 
> https://postgr.es/m/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g%40mail.gmail.com
> ---
>  src/backend/executor/execGrouping.c | 2 +-
>  src/backend/executor/nodeAgg.c      | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/src/backend/executor/execGrouping.c 
> b/src/backend/executor/execGrouping.c
> index c107514a85d..21f5e4cabcc 100644
> --- a/src/backend/executor/execGrouping.c
> +++ b/src/backend/executor/execGrouping.c
> @@ -246,7 +246,7 @@ BuildTupleHashTable(PlanState *parent,
>        * underestimated.
>        */
>       if (use_variable_hash_iv)
> -             hash_iv = murmurhash32(ParallelWorkerNumber);
> +             hash_iv = hash_combine(ParallelWorkerNumber, 
> parent->plan->plan_node_id);
>  
>       hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);

I think some of the reasoning from your commit messages needs to be in a
comment somewhere too, otherwise the next person tackling this code will have
a hell of a time.


> diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
> index 7d487a165fa..5c947fd151f 100644
> --- a/src/backend/executor/nodeAgg.c
> +++ b/src/backend/executor/nodeAgg.c
> @@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double 
> nbuckets)
>                                                                               
>          metacxt,
>                                                                               
>          tuplescxt,
>                                                                               
>          tmpcxt,
> -                                                                             
>          DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
> +                                                                             
>          true);
>  }

Hm, won't that mean we'll now always use the slower path?  I think there are
non-partial cases that could be problematic, but if we do this we probaly
should do the optimization of the hash expression stuff I mentioned above.

Greetings,

Andres Freund

Re: Hash aggregate collisions cause excessive spilling

Reply via email to