> Do I understand correctly you propose to collect such stats for every
> foreign key? I recall something like that was proposed in the past, and
> the argument against was that for many joins it'd be a waste because the
> estimates are good enough. And for OLTP systems that's probably true.
>

Not every foreign key, they'd be declared like CREATE STATISTICS, but would
be anchored to the constraint, not to the table.


> But I always assumed we'd have a way to explicitly enable such stats for
> certain joins only, and the extended stats were designed to make that
> possible.
>

That's the intention, but the stats stored don't quite "fit" in the buckets
that extended stats create. The attribute statistics seem much better
suited, as this isn't about combinations, there's only ever the one
combination, but rather about what can be known about the attributes in the
far table before doing the actual join.


> FWIW I'm not entirely sure what stats you propose to collect exactly. I
> mean, what does
>
>    ... associated with t1 and weighted according to the frequency of
>    that row being referenced, which means that values of unreferenced
>    rows are filtered out entirely.
>
> mean? Are you suggesting to "do the join" and build the regular stats as
> if that was a regular table? I think that'd work, and it's mostly how I
> envisioned to handle joins in extended stats, restricted to joins of two
> relations.
>

Right. We'd do the join from t1 to t2 as described earlier, and then we'd
judge the null_frac, mcv, etc for each column of t2 (as defined by the
scope of the stats declaration) according to the join. More commonly
referenced values would show up as more frequent, hence "weighted".

Just so I have an example to refer to later, say we have a table of colors:

CREATE TABLE color(id bigint primary key, color_name text unique,
color_family text null)

and there's hundreds of colors in the table that are color_family='red'
('fire engine red', 'candy apple red', 'popular muppet red', etc). Some
colors don't belong to any color_family.

And we have a table of toys:

CREATE TABLE toy(id bigint primary key, min_child_age integer, name text,
color_id bigint REFERENCES color)

And we declare a join stat on toy->color for the color_family attribute.
We'd sample rows from the toy table, left join those to color, and then
calculate the attribute stats of color_family as if it were a column in
toys. Some toys might not have a color_id, and some color_ids might not
belong to a color_family, so we'd want the null_frac to reflect those
combined conditions. For the values that do join, and the colors that do
belong to a family, we'd want to see regular MCV stats showing "red" as the
most common color_family.

But those stats aren't really a correlation or a dependency, they're just
plain old attribute stats.

I understand wanting to know the correlation between toys.min_child_age and
colors.color_family, so that makes perfect sense for extended statistics,
but color_family on its own just doesn't fit. Am I missing something?


> Combining independent per-table samples does not work, unless the
> samples are huge. There's a nice paper [1] on how to do index-based join
> sampling efficiently.
>

Thanks, now I've got some light reading for the flight home.


> I think adding joins to extended stats would not be all that hard
> (famous last words, I know). For me the main challenge was figuring out
> how to store the join definition in the catalog, I always procrastinated
> and never gave that a serious try.
>

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.


>
> FWIW I think we might start by actually using per-table extended stats
> on the joined tables. Just like we combine the scalar MCVs on joined
> columns, we could combine multicolumn MVCs.
>

That's the other half of this - if the stats existed, do we have an obvious
way to put them to use?

Reply via email to