Re: Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker Thu, 28 May 2026 12:03:54 -0700

>
> > To my mind, having stats on certain views would be extremely simple for
> > export/import, we'd simply have one more relkind that makes it into the
> > system views pg_stats and pg_stats_ext, and the statistics for that new
> > relation would plug into pg_statistic with no catalog change to
> > pg_statistic whatsoever. Additionally, allowing certain kinds of views
> > (or a relation relkind that's functionally equivalent to a view) to have
> > statistics makes it easy to define extended statistics on those views,
> > with no catalog change to pg_statistic_ext.
> >
>
> Ah, so you're proposing supporting CREATE STATISTICS on some views? I
> guess that's one way to support stats on joins, without having to rework
> the schema. I'm still not a huge fan of it, because it just uses views
> as a workaround to store the join definition, nothing else. To me it
> seems a weird to require creating a new relation just for this, and we
> already envisioned CREATE STATISTICS would cover joins. My guess is that
> may be why DB2 did it this way, as they probably didn't have anything
> like extended stats at that point.
>


They did have extended stats.

Here are some example RUNSTATS calls from
https://www.ibm.com/docs/en/db2/11.5.x?topic=commands-runstats

    RUNSTATS ON TABLE employee ON ALL COLUMNS AND COLUMNS ((JOB, WORKDEPT,
SEX))
        WITH DISTRIBUTION

In our lingo, this is roughly:

    CREATE STATISTICS foo ON (job, workdept, sex) FROM employee;
    ANALYZE employee;

And you can tweak individual columns

    RUNSTATS ON VIEW product_sales_view
     WITH DISTRIBUTION ON COLUMNS (category NUM_FREQVALUES 100
NUM_QUANTILES 100,
    type, product_key) DEFAULT NUM_FREQVALUES 50 NUM_QUANTILES 50

But the lesson I took away from this doc and corroborated by meeting with
someone who used to work on DB2 planner is that they have an equivalent of
extended stats.

>
> > Having something in pg_class to anchor existing per-attribute and multi-
> > attribute stats off of seems like a big win to me. We'd get all pre-
> > existing statistics types for free, statistics kinds provided by
> > extensions for free, extended stats for free, import and export for
> > free. Well, not free, we just have to remove the exclusion that views
> > (or whatever relkind we create) can't have stats.
> >
>
> TBH the grammar / catalog stuff seems like a relatively minor part of
> this patch. To me, the difficult part seems to be the sampling / analyze
> part, and then matching it to the query during planning. And all of this
> seems exactly the same no matter how the stats are defined.
>

I agree that the catalog stuff is minor relative to the work matching
queries to the join node trees associated with a set of join stats,
wherever that resides.

The sampling analyze part doesn't strike me as too hard. The worst case
scenario is that we take the columns+join definition, swap the "anchor"
table out replacing it with the EphemeralNamedRelation of the fetched row
sample from the anchor table, and just run that, letting SPI figure out
which indexes can be used.

The question of _when_ to analyze is a bit trickier, as was pointed out by
Chenpeng Yan, and I suspect that any "anchor" table with join stats on it
will need to be analyzed once any one of the tables it joins to has hit the
threshold.


> OTOH maybe allowing CREATE STATISTICS on views would be independently
> useful too, once we have the later parts. Not sure.
>

I think so as well, which is why I'd like to leave that option open.


> > - we'd want a way to express stats for all the individual columns/
> > expressions defined in the join stats object, i.e. "what is the MCV of
> > B.name for rows joined by A on A.b_id"?
> >
>
> Maybe I misunderstand, but don't we already do this for statistics on
> expressions?
>

We can CREATE STATISTICS foo on (upper(name)) FROM table, if that's what
you're asking.

But that isn't what I'm saying. Given tables A, B, and C which can join on
A.b_id = B.id and A.c_id = c.id, we already have pg_statistic stats for
B.name, but that's across B in a vacuum. I'm assuming that there's also
value in having the stats for B.name filtered and weighted by how often (if
at all) the B row is joined to A, and those stats would be shaped exactly
like the pg_statistic row for B.name. We could view that as correlating
B.name against the primary key of A, but that seems odd to me.

Additionally, we might want the ability to have correlative stats of B.name
with c.purchase_date. With the views idea, that would just be:

      CREATE STATISTICS foo ON (b_name, c_purchase_date) FROM
my_statistics_view.

So we'd have regular stats to draw upon from my_statistics view, and those
would have our cardinality estimates given the precondition of the row
surviving the join criteria, and we'd further have the correlative
statistics of two columns each from tabled joined to A. And that's
something not even DB2 can do now given their 2-table join limitation.


> I don't follow. What 5-factorial combinations? We only ever build a
> single MCV for a given statistics object.
>

Sorry, I was thinking about pg_ndistinct and pg_dependencies.

For MCV that would give us the most common tuples of (B.name,
C.purchase_date, D.promotion_code, a.quantity, a.amount), but that only
helps when we need all of those columns, not a subset of them.


> I may be missing something, but I honestly the only benefit of views I
> can think of is already having the join definition in a catalog.
>

That's the biggest benefit for sure, but the other benefit is we can have
the per-column stats in pg_statistic as I detailed above, and further do
extended statistics, as we both addressed above.


> I don't follow. Why should statistics object be pg_class objects? In my
> mind pg_class is meant for "relations" (as in, table-like things), and
> statistics objects are not like that. I suppose this is related to your
> earlier suggestion
>
>   Having something in pg_class to anchor existing per-attribute and
>   multi-attribute stats off of seems like a big win to me.
>

It is, but it would also allow us to avoid having a declared view, and then
altering that view to allow statistics.

It would make it more complicated to express which elements of the join
were worthy of correlative stats, so in that sense having a view with a
name is conceptually simpler.


> but it's not clear to me why would that be? All statistics are tied to a
> relation (or multiple relations) in the end, and I don't see why would
> the view make anything simpler. What are the wins?
>

The view would allow us to put a wider range of stats on all of the columns
that could be found through that defined join, using the mechanisms we
already have available to us.

It would further allow us to declare that multiple subsets of those columns
are interesting enough to warrant correlative (extended) statistics.


> Could you elaborate? It's entirely possible I just don't see something
> obvious, or maybe you explained this in Vancouver and I managed to
> forget the details. Sorry about that.


No worries, Vancouver was a whirlwind of ideas around this topic. My head
is still swimming.

Another way of thinking about this is that it would give us the stats that
we could get from a materialized view, with the ability to define extended
stats on top of that materialized view, but without the costs of
maintaining a materialized view, and without the restriction that a query
has to directly reference the materialized view.

Re: Is there value in having optimizer stats for joins/foreignkeys?

Reply via email to