On 1/30/26 08:29, Corey Huinker wrote:
> 
>     Current Limitations
>     -------------------
> 
>     This is a proof of concept. Known limitations include:
> 
> 
> I really like this proof of concept.
> 
>  
> 
>     1. The current catalog design is not ideal. It is asymmetric (a
>     "primary" and an "other" table), which is natural for FK-like joins,
>     but less intuitive for other joins
> 
> 
> I think the asymmetry comes with the territory, and we will be creating
> the join statistics that prove useful. If that means that we create one
> object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b
> JOIN a...  then so be it.
>  
> 
>     .
> 
>     2. Stats collection piggybacks on ANALYZE of the primary table and
>     uses its single-column MCV for the join key.  This can be inaccurate
>     when the MCV values on the "primary" side don't cover the important
>     values on the other side, or when the filter column isn't fully
>     dependent on the join key. A more accurate approach would execute the
>     actual join during collection, which could also decouple join stats
>     collection from single-table ANALYZE.
> 
> 
> Unfortunately, I think we will have to join the remote table_b to the
> row sample on table_a to get accurate join statistics, and the best time
> to do that when we already have the row sample from table_a. We can
> further batch up statistics objects that happen to join table_a to
> table_b by the same join criteria to avoid rescans.
> 

IIRC the index-based sampling (described in the paper I mentioned) works
something like this - sample leading table, then use indexes to lookup
data from the other tables to build a statistically correct sample of
the whole join.

> Will what you have work when we want to do an MCV on a mix of local and
> remote columns, or will that require more work?
>  
> 
>     3. Currently limited to: equality join clauses, equality and IN filter
>     clauses, simple Var stats objects (no expressions), inner joins only,
>     and two-way joins only. Some of these are easier to extend; others may
>     be harder or unnecessary (like n-way joins).
> 
> 
> I suspect that n-way joins would have very limited utility and could be
> adequately covered with multiple join-stats objects.
>  

I see no clear reason why that would be the case, and it's not that hard
to construct joins on 3+ tables where stats on 2-way joins won't be good
enough. Whether those cases are sufficiently common I don't know, but I
also don't see a good reason to not address them - it should not be much
more complex than 2-way joins (famous last words, I know).

> 
> 
>     4. Patch 0002 (auto-creation from FK constraints) should probably be
>     gated behind a GUC. I'm not strongly attached to this patch, but kept
>     it because FK joins seem like a natural and common use case.
> 
> 
> I think this is like indexing, just because you can make all possible
> columns indexed doesn't mean you should. Tooling will emerge to
> determine what join stats objects are worth their weight, and create
> only those objects.
> 

Agreed.

>     If there's interest, I'm happy to continue iterating on the design.
> 
>     In particular, I'd welcome feedback on:
>     - whether this is a direction worth pursuing,
> 
> 
> Yes. Very much so.
>  
> 
>     - the catalog design,
> 
> 
> I had somehow gotten the impression that you were going to take the
> extended statistics format, but store individual columns in a modified
> pg_statistic. That's working for now, but I wonder if that will still be
> the case when we start to try this for columns that are arrays, ranges,
> multiranges, tsvectors, etc. pg_statistic has the infrastructure to
> handle those, but there may be good reason to keep pg_statistic focused
> on local attributes and instead just keep adding new kinds to extended
> statistic and let it be the grab-bag it was perhaps always meant to be.
> 
> In my own musings on how to implement this (which you have far exceeded
> with this proof-of-concept), I had wondered how to stats for k.keyword
> and k.phonetic_code individually from the definition
> of movie_keywords2_multi_stats, but looking at what you've done I think
> we're better of with defining each statistic object very narrowly, and
> that means we define one object per remote column and one object per
> interesting combination of columns, then so be it. So long as we can
> calculate them all from the same join of the two tables, we'll avoid the
> nasty overhead.
>  

Not sure I understand this. Why would this use pg_statistic at all? I
imagined we'd just treat the join as a relation, and store the stats in
pg_statistic_data_ext. The only difference is we need to store info
about the join itself (which rels / conditions), which would go into
pg_statistic_ext.

> 
>     - and scope (what kinds of joins / predicates are worth supporting).
> 
> 
> between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
> mind.
> 
> element_mcv for array types might be next, but that's well down the road.

Maybe. In v1 we should focus on mimicking what we have for per-relation
stats, and only then try doing something more.

regards

-- 
Tomas Vondra



Reply via email to