Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Ilia Evdokimov Mon, 16 Feb 2026 08:16:45 -0800


On 07.02.2026 16:39, Chengpeng Yan wrote:


> On Feb 3, 2026, at 13:39, John Naylor <[email protected]> wrote:
>
> To evaluate that, it would be easier if the patch were split into two
> commits, one with the simple step approach, and one that changes to
> the other approach.
>
> Also, we have some places that switch from e.g. an array to a hash
> table once a collection gets above a certain threshold. In this case,
> however, having less than 100 stat target is unheard of, in my
> experience. It's only a tiny amount of code, but it doesn't seem worth
> special-casing. Others may have a different opinion, and I'm not sure
> we don't already have this behavior elsewhere.

Thanks for the suggestion — I’ve applied it in v4 by splitting the
changes into two commits.

I think it would be beneficial to split it into two logicallyindependent parts. Right now the patch introduces two different changesat once:

1. Replacing the original shift-based singleton handling with the cursorbased eviction mechanism.

2. Adding the hash-based lookup.

Splitting the patch would make review easier because I can first focuspurely on the algorithmic transformation (shift -> cursor) and verifysemantic equivalence.


Regarding the hash threshold: my initial approach followed existing
postgres precedent for hash-based MCV matching. For example, commit
057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
the MCV set is large enough to amortize the setup cost. The exact
threshold differs, but the pattern is similar.

Also, while 100 is the default statistics target, there are internal
cases that use smaller values — e.g. vacuumdb staged analyze temporarily
sets default_statistics_target to 1 and 10 in vacuuming.c — which was
part of the motivation for keeping the thresholded behavior.

That said, I’m happy to adjust or remove the threshold if the consensus
is to simplify this further.

I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD asa direct reference point here. That threshold is somewhat heuristic aswell, and the cost trade-offs in eqjoinsel() are not identical to whatwe have in compute_distinct_stats(). In particular, here the break-evenpoint will strongly depend on the data type and value width. Forexample, with wider datums (e.g. bytea), the comparison cost increases,and the threshold at which hashing amortizes its setup cost may shiftnoticeably. We don't need to compute the exact optimal threshold - justget a reasonable order.


--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Reply via email to