On 5/21/26 22:25, Tom Lane wrote:
> Alexandra Wang <[email protected]> writes:
>> Here's v7, another attempt to fix the unstable tests.
> 
> Hi Alexandra,
> 
> I signed up for an in-person review of this at PGConf.dev, but
> the schedule doesn't seem to be working in favor of making that
> happen.  If you see this and happen to run into me in the
> hallway, I'm happy to chat, but in any case here are my
> rather-hasty review notes.
> 
> I think it's okay if v1 only handles 2-way joins, as long as the
> catalog representation is prepared for more.  Restricting to
> cases where we can do index-based sampling seems fine too.
> Those things could be relaxed later if it seems worthwhile,
> but we'd have a creditable feature even without.
> 

+1

My assumption is allowing larger joins would be a somewhat mechanical.
I'm not aware of additional problems on top of 2-way joins.

I think we should aim to support larger joins. At the unconference there
were suggestions maybe it would be enough to support 2-way joins, but
it's not hard to construct cases where that's no sufficient. Consider a
fact table, joining to two dimensions. It's common for the dimensions to
be correlated in various ways, and 2-way joins can't handle these cases.

> I didn't read the sampling code in any detail.  I think you will
> need to put more thought into what is user-friendly behavior
> in case the required index doesn't exist or doesn't have the
> right properties.  (I think the tests for that might not be
> strong enough, either.)
> 

What would be the most "user-friendly behavior" in this case?

I think we can either (a) refuse defining/building the join statistics
in this case, or (b) fallback to sampling not requiring an index (but
then it'll be way more expensive).

I think (a) should be fine for now, i.e. we should require an index.
Most of the joins will be on FK constraints, or something like that (the
FK may not be defined, but there will be a PK on one side).

Not sure if we should simply refuse building the stats, or if it's
enough to detect this while building the statistics. I'd say enforcing
this during DDL (CREATE STATISTICS, DROP INDEX, ...) is better,
otherwise users may not notice the statistics stopped building.

> I think you could simplify some code noticeably if you included the
> anchor rel's OID as the first element of stxjoinrels[].  Yeah,
> it'd be redundant with stxrelid, but so what?  It's not like 
> pg_statistic_ext rows are narrow enough that anyone would notice
> the extra 4 bytes.  I think this would simplify some of the
> relationships within the data structures, too, eg all varnos in
> the expressions could be considered to reference stxjoinrels[].
> 
> I don't love stxkeyrefs[].  I wonder if it's time to throw away
> stxkeys[], represent all the target columns as regular expression
> trees in stxexprs, and then special-case columns that are simple
> Vars where appropriate at execution.
> 

+1, I don't see a reason to not store the anchor rel separately.

> (In the same vein, I dislike the grammar's separation of plain
> columns from expressions; I'd like to replace stats_params
> with expr_list and sort it all out later.  But perhaps that's
> material for a separate patch.)
> 

FWIW the extended stats copied this from pg_index, which also stores
keys and expressions separately. I suppose there was a reason for that,
most likely performance - is cheaper to compare attnums than
expressions, and plain keys are much more common.

Maybe that's no longer true, or maybe it's not as important for extended
stats (there's likely fewer of those, compared to indexes).

> We will need to put more thought into permissions: I don't think
> requiring all the tables to have the same owner is workable.
> (What happens if someone tries to ALTER OWNER later?)  However,
> if they don't all have the same owner, there are potential security
> problems, so the right restriction is not obvious.  This is not
> necessary to solve now; there are bigger questions to worry about.
> But we'll need an answer before it's committable.
> 

I have not thought about this at all, but what can we do if the tables
have different owners? I suppose we could require the stxowner to have
SELECT privilege on the joined relations (instead of owning them).

> It's not too soon to write some user-facing documentation.
> CREATE STATISTICS man page obviously needs attention, but
> also the discussion of extended stats in perform.sgml.
> And catalogs.sgml.  I find that writing that sort of stuff
> helps to clarify where one's design is weak.
> 

+1


regards

-- 
Tomas Vondra



Reply via email to