On 07.02.2026 16:39, Chengpeng Yan wrote:

> On Feb 3, 2026, at 13:39, John Naylor <[email protected]> wrote:
>
> To evaluate that, it would be easier if the patch were split into two
> commits, one with the simple step approach, and one that changes to
> the other approach.
>
> Also, we have some places that switch from e.g. an array to a hash
> table once a collection gets above a certain threshold. In this case,
> however, having less than 100 stat target is unheard of, in my
> experience. It's only a tiny amount of code, but it doesn't seem worth
> special-casing. Others may have a different opinion, and I'm not sure
> we don't already have this behavior elsewhere.

Thanks for the suggestion — I’ve applied it in v4 by splitting the
changes into two commits.

I think it would be beneficial to split it into two logically independent parts. Right now the patch introduces two different changes at once:

1. Replacing the original shift-based singleton handling with the cursor based eviction mechanism.
2. Adding the hash-based lookup.

Splitting the patch would make review easier because I can first focus purely on the algorithmic transformation (shift -> cursor) and verify semantic equivalence.



Regarding the hash threshold: my initial approach followed existing
postgres precedent for hash-based MCV matching. For example, commit
057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
the MCV set is large enough to amortize the setup cost. The exact
threshold differs, but the pattern is similar.

Also, while 100 is the default statistics target, there are internal
cases that use smaller values — e.g. vacuumdb staged analyze temporarily
sets default_statistics_target to 1 and 10 in vacuuming.c — which was
part of the motivation for keeping the thresholded behavior.

That said, I’m happy to adjust or remove the threshold if the consensus
is to simplify this further.

I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as a direct reference point here. That threshold is somewhat heuristic as well, and the cost trade-offs in eqjoinsel() are not identical to what we have in compute_distinct_stats(). In particular, here the break-even point will strongly depend on the data type and value width. For example, with wider datums (e.g. bytea), the comparison cost increases, and the threshold at which hashing amortizes its setup cost may shift noticeably. We don't need to compute the exact optimal threshold - just get a reasonable order.

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Reply via email to