> > > To my mind, having stats on certain views would be extremely simple for > > export/import, we'd simply have one more relkind that makes it into the > > system views pg_stats and pg_stats_ext, and the statistics for that new > > relation would plug into pg_statistic with no catalog change to > > pg_statistic whatsoever. Additionally, allowing certain kinds of views > > (or a relation relkind that's functionally equivalent to a view) to have > > statistics makes it easy to define extended statistics on those views, > > with no catalog change to pg_statistic_ext. > > > > Ah, so you're proposing supporting CREATE STATISTICS on some views? I > guess that's one way to support stats on joins, without having to rework > the schema. I'm still not a huge fan of it, because it just uses views > as a workaround to store the join definition, nothing else. To me it > seems a weird to require creating a new relation just for this, and we > already envisioned CREATE STATISTICS would cover joins. My guess is that > may be why DB2 did it this way, as they probably didn't have anything > like extended stats at that point. >
They did have extended stats. Here are some example RUNSTATS calls from https://www.ibm.com/docs/en/db2/11.5.x?topic=commands-runstats RUNSTATS ON TABLE employee ON ALL COLUMNS AND COLUMNS ((JOB, WORKDEPT, SEX)) WITH DISTRIBUTION In our lingo, this is roughly: CREATE STATISTICS foo ON (job, workdept, sex) FROM employee; ANALYZE employee; And you can tweak individual columns RUNSTATS ON VIEW product_sales_view WITH DISTRIBUTION ON COLUMNS (category NUM_FREQVALUES 100 NUM_QUANTILES 100, type, product_key) DEFAULT NUM_FREQVALUES 50 NUM_QUANTILES 50 But the lesson I took away from this doc and corroborated by meeting with someone who used to work on DB2 planner is that they have an equivalent of extended stats. > > > Having something in pg_class to anchor existing per-attribute and multi- > > attribute stats off of seems like a big win to me. We'd get all pre- > > existing statistics types for free, statistics kinds provided by > > extensions for free, extended stats for free, import and export for > > free. Well, not free, we just have to remove the exclusion that views > > (or whatever relkind we create) can't have stats. > > > > TBH the grammar / catalog stuff seems like a relatively minor part of > this patch. To me, the difficult part seems to be the sampling / analyze > part, and then matching it to the query during planning. And all of this > seems exactly the same no matter how the stats are defined. > I agree that the catalog stuff is minor relative to the work matching queries to the join node trees associated with a set of join stats, wherever that resides. The sampling analyze part doesn't strike me as too hard. The worst case scenario is that we take the columns+join definition, swap the "anchor" table out replacing it with the EphemeralNamedRelation of the fetched row sample from the anchor table, and just run that, letting SPI figure out which indexes can be used. The question of _when_ to analyze is a bit trickier, as was pointed out by Chenpeng Yan, and I suspect that any "anchor" table with join stats on it will need to be analyzed once any one of the tables it joins to has hit the threshold. > OTOH maybe allowing CREATE STATISTICS on views would be independently > useful too, once we have the later parts. Not sure. > I think so as well, which is why I'd like to leave that option open. > > - we'd want a way to express stats for all the individual columns/ > > expressions defined in the join stats object, i.e. "what is the MCV of > > B.name for rows joined by A on A.b_id"? > > > > Maybe I misunderstand, but don't we already do this for statistics on > expressions? > We can CREATE STATISTICS foo on (upper(name)) FROM table, if that's what you're asking. But that isn't what I'm saying. Given tables A, B, and C which can join on A.b_id = B.id and A.c_id = c.id, we already have pg_statistic stats for B.name, but that's across B in a vacuum. I'm assuming that there's also value in having the stats for B.name filtered and weighted by how often (if at all) the B row is joined to A, and those stats would be shaped exactly like the pg_statistic row for B.name. We could view that as correlating B.name against the primary key of A, but that seems odd to me. Additionally, we might want the ability to have correlative stats of B.name with c.purchase_date. With the views idea, that would just be: CREATE STATISTICS foo ON (b_name, c_purchase_date) FROM my_statistics_view. So we'd have regular stats to draw upon from my_statistics view, and those would have our cardinality estimates given the precondition of the row surviving the join criteria, and we'd further have the correlative statistics of two columns each from tabled joined to A. And that's something not even DB2 can do now given their 2-table join limitation. > I don't follow. What 5-factorial combinations? We only ever build a > single MCV for a given statistics object. > Sorry, I was thinking about pg_ndistinct and pg_dependencies. For MCV that would give us the most common tuples of (B.name, C.purchase_date, D.promotion_code, a.quantity, a.amount), but that only helps when we need all of those columns, not a subset of them. > I may be missing something, but I honestly the only benefit of views I > can think of is already having the join definition in a catalog. > That's the biggest benefit for sure, but the other benefit is we can have the per-column stats in pg_statistic as I detailed above, and further do extended statistics, as we both addressed above. > I don't follow. Why should statistics object be pg_class objects? In my > mind pg_class is meant for "relations" (as in, table-like things), and > statistics objects are not like that. I suppose this is related to your > earlier suggestion > > Having something in pg_class to anchor existing per-attribute and > multi-attribute stats off of seems like a big win to me. > It is, but it would also allow us to avoid having a declared view, and then altering that view to allow statistics. It would make it more complicated to express which elements of the join were worthy of correlative stats, so in that sense having a view with a name is conceptually simpler. > but it's not clear to me why would that be? All statistics are tied to a > relation (or multiple relations) in the end, and I don't see why would > the view make anything simpler. What are the wins? > The view would allow us to put a wider range of stats on all of the columns that could be found through that defined join, using the mechanisms we already have available to us. It would further allow us to declare that multiple subsets of those columns are interesting enough to warrant correlative (extended) statistics. > Could you elaborate? It's entirely possible I just don't see something > obvious, or maybe you explained this in Vancouver and I managed to > forget the details. Sorry about that. No worries, Vancouver was a whirlwind of ideas around this topic. My head is still swimming. Another way of thinking about this is that it would give us the stats that we could get from a materialized view, with the ability to define extended stats on top of that materialized view, but without the costs of maintaining a materialized view, and without the restriction that a query has to directly reference the materialized view.
