On 1/30/26 08:29, Corey Huinker wrote: > > Current Limitations > ------------------- > > This is a proof of concept. Known limitations include: > > > I really like this proof of concept. > > > > 1. The current catalog design is not ideal. It is asymmetric (a > "primary" and an "other" table), which is natural for FK-like joins, > but less intuitive for other joins > > > I think the asymmetry comes with the territory, and we will be creating > the join statistics that prove useful. If that means that we create one > object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b > JOIN a... then so be it. > > > . > > 2. Stats collection piggybacks on ANALYZE of the primary table and > uses its single-column MCV for the join key. This can be inaccurate > when the MCV values on the "primary" side don't cover the important > values on the other side, or when the filter column isn't fully > dependent on the join key. A more accurate approach would execute the > actual join during collection, which could also decouple join stats > collection from single-table ANALYZE. > > > Unfortunately, I think we will have to join the remote table_b to the > row sample on table_a to get accurate join statistics, and the best time > to do that when we already have the row sample from table_a. We can > further batch up statistics objects that happen to join table_a to > table_b by the same join criteria to avoid rescans. >
IIRC the index-based sampling (described in the paper I mentioned) works something like this - sample leading table, then use indexes to lookup data from the other tables to build a statistically correct sample of the whole join. > Will what you have work when we want to do an MCV on a mix of local and > remote columns, or will that require more work? > > > 3. Currently limited to: equality join clauses, equality and IN filter > clauses, simple Var stats objects (no expressions), inner joins only, > and two-way joins only. Some of these are easier to extend; others may > be harder or unnecessary (like n-way joins). > > > I suspect that n-way joins would have very limited utility and could be > adequately covered with multiple join-stats objects. > I see no clear reason why that would be the case, and it's not that hard to construct joins on 3+ tables where stats on 2-way joins won't be good enough. Whether those cases are sufficiently common I don't know, but I also don't see a good reason to not address them - it should not be much more complex than 2-way joins (famous last words, I know). > > > 4. Patch 0002 (auto-creation from FK constraints) should probably be > gated behind a GUC. I'm not strongly attached to this patch, but kept > it because FK joins seem like a natural and common use case. > > > I think this is like indexing, just because you can make all possible > columns indexed doesn't mean you should. Tooling will emerge to > determine what join stats objects are worth their weight, and create > only those objects. > Agreed. > If there's interest, I'm happy to continue iterating on the design. > > In particular, I'd welcome feedback on: > - whether this is a direction worth pursuing, > > > Yes. Very much so. > > > - the catalog design, > > > I had somehow gotten the impression that you were going to take the > extended statistics format, but store individual columns in a modified > pg_statistic. That's working for now, but I wonder if that will still be > the case when we start to try this for columns that are arrays, ranges, > multiranges, tsvectors, etc. pg_statistic has the infrastructure to > handle those, but there may be good reason to keep pg_statistic focused > on local attributes and instead just keep adding new kinds to extended > statistic and let it be the grab-bag it was perhaps always meant to be. > > In my own musings on how to implement this (which you have far exceeded > with this proof-of-concept), I had wondered how to stats for k.keyword > and k.phonetic_code individually from the definition > of movie_keywords2_multi_stats, but looking at what you've done I think > we're better of with defining each statistic object very narrowly, and > that means we define one object per remote column and one object per > interesting combination of columns, then so be it. So long as we can > calculate them all from the same join of the two tables, we'll avoid the > nasty overhead. > Not sure I understand this. Why would this use pg_statistic at all? I imagined we'd just treat the join as a relation, and store the stats in pg_statistic_data_ext. The only difference is we need to store info about the join itself (which rels / conditions), which would go into pg_statistic_ext. > > - and scope (what kinds of joins / predicates are worth supporting). > > > between-ish clauses ( x>= y AND x < z), etc is what immediately comes to > mind. > > element_mcv for array types might be next, but that's well down the road. Maybe. In v1 we should focus on mimicking what we have for per-relation stats, and only then try doing something more. regards -- Tomas Vondra
