On 07.02.2026 16:39, Chengpeng Yan wrote:
> On Feb 3, 2026, at 13:39, John Naylor <[email protected]> wrote:
>
> To evaluate that, it would be easier if the patch were split into two
> commits, one with the simple step approach, and one that changes to
> the other approach.
>
> Also, we have some places that switch from e.g. an array to a hash
> table once a collection gets above a certain threshold. In this case,
> however, having less than 100 stat target is unheard of, in my
> experience. It's only a tiny amount of code, but it doesn't seem worth
> special-casing. Others may have a different opinion, and I'm not sure
> we don't already have this behavior elsewhere.
Thanks for the suggestion — I’ve applied it in v4 by splitting the
changes into two commits.
I think it would be beneficial to split it into two logically
independent parts. Right now the patch introduces two different changes
at once:
1. Replacing the original shift-based singleton handling with the cursor
based eviction mechanism.
2. Adding the hash-based lookup.
Splitting the patch would make review easier because I can first focus
purely on the algorithmic transformation (shift -> cursor) and verify
semantic equivalence.
Regarding the hash threshold: my initial approach followed existing
postgres precedent for hash-based MCV matching. For example, commit
057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
the MCV set is large enough to amortize the setup cost. The exact
threshold differs, but the pattern is similar.
Also, while 100 is the default statistics target, there are internal
cases that use smaller values — e.g. vacuumdb staged analyze temporarily
sets default_statistics_target to 1 and 10 in vacuuming.c — which was
part of the motivation for keeping the thresholded behavior.
That said, I’m happy to adjust or remove the threshold if the consensus
is to simplify this further.
I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as
a direct reference point here. That threshold is somewhat heuristic as
well, and the cost trade-offs in eqjoinsel() are not identical to what
we have in compute_distinct_stats(). In particular, here the break-even
point will strongly depend on the data type and value width. For
example, with wider datums (e.g. bytea), the comparison cost increases,
and the threshold at which hashing amortizes its setup cost may shift
noticeably. We don't need to compute the exact optimal threshold - just
get a reasonable order.
--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/