Re: PoC/WIP: Extended statistics on expressions

Justin Pryzby Tue, 23 Mar 2021 23:25:10 -0700

Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'


It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me.  For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".  

> +       Name of schema containing table

I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table.  Say "Name of the schema which contains the table" ?

> +       Name of table

Say "name of table on which the extended statistics are defined"

> +       Name of extended statistics

"Name of the extended statistic object"

> +       Owner of the extended statistics

..object

> +       Expression the extended statistics is defined on

I think it should say "the extended statistic", or "the extended statistics
object".  Maybe "..on which the extended statistic is defined"

> +       of random access to the disk.  (This expression is null if the 
> expression
> +       data type does not have a <literal>&lt;</literal> operator.)

expression's data type

> +   much-too-small row count estimate in the first two queries. Moreover, the

maybe say "dramatically underestimates the rowcount"

> +   planner has no information about relationship between the expressions, so 
> it

the relationship

> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
> +   conditions are independent, and multiplies their selectivities together to
> +   arrive at a much-too-high group count estimate in the aggregate query.

severe overestimate ?

> +   This is further exacerbated by the lack of accurate statistics for the
> +   expressions, forcing the planner to use default ndistinct estimate for the

use *a default

> +   expression derived from ndistinct for the column. With such statistics, 
> the
> +   planner recognizes that the conditions are correlated and arrives at much
> +   more accurate estimates.

are correlated comma

> +                     if (type->lt_opr == InvalidOid)

These could be !OidIsValid

> +      * expressions. It's either expensive or very easy to defeat for
> +      * determined used, and there's no risk if we allow such statistics (the
> +      * statistics is useless, but harmless).

I think it's meant to say "for a determined user" ?

> +      * If there are no simply-referenced columns, give the statistics an 
> auto
> +      * dependency on the whole table.  In most cases, this will be 
> redundant,
> +      * but it might not be if the statistics expressions contain no Vars
> +      * (which might seem strange but possible).
> +      */
> +     if (!nattnums)
> +     {
> +             ObjectAddressSet(parentobject, RelationRelationId, relid);
> +             recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
> +     }

Can this be unconditional ?

> +      * Translate the array of indexs to regular attnums for the dependency 
> (we

sp: indexes

> +                                      * Not found a matching expression, so 
> we can simply skip

Found no matching expr

> +                             /* if found a matching, */

matching ..

> +examine_attribute(Node *expr)

Maybe you should rename this to something distinct ?  So it's easy to add a
breakpoint there, for example.

> +     stats->anl_context = CurrentMemoryContext;      /* XXX should be using
> +                                                                             
>                  * something else? */

> +             bool            nulls[Natts_pg_statistic];
...
> +              * Construct a new pg_statistic tuple
> +              */
> +             for (i = 0; i < Natts_pg_statistic; ++i)
> +             {
> +                     nulls[i] = false;
> +             }

Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));

> +                              * We don't store collations used to build the 
> statistics, but
> +                              * we can use the collation for the attribute 
> itself, as
> +                              * stored in varcollid. We do reset the 
> statistics after a
> +                              * type change (including collation change), so 
> this is OK. We
> +                              * may need to relax this after allowing 
> extended statistics
> +                              * on expressions.

This text should be updated or removed ?

> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>               }
>  
>               /* print any extended statistics */
> -             if (pset.sversion >= 100000)
> +             if (pset.sversion >= 140000)
> +             {
> +                     printfPQExpBuffer(&buf,
> +                                                       "SELECT oid, "
> +                                                       
> "stxrelid::pg_catalog.regclass, "
> +                                                       
> "stxnamespace::pg_catalog.regnamespace AS nsp, "
> +                                                       "stxname,\n"
> +                                                       
> "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
> +                                                       "  'd' = any(stxkind) 
> AS ndist_enabled,\n"
> +                                                       "  'f' = any(stxkind) 
> AS deps_enabled,\n"
> +                                                       "  'm' = any(stxkind) 
> AS mcv_enabled,\n");
> +
> +                     if (pset.sversion >= 130000)
> +                             appendPQExpBufferStr(&buf, "  stxstattarget\n");
> +                     else
> +                             appendPQExpBufferStr(&buf, "  -1 AS 
> stxstattarget\n");

 >= 130000 is fully determined by >= 14000 :)

> +      * type of the opclass, which is not interesting for our purposes.  
> (Note:
> +      * if we did anything with non-expression index columns, we'd need to

index is wrong ?

I mentioned a bunch of other references to "index" and "predicate" which are
still around:

On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote:
> There's some remaining copy/paste stuff from index expressions:
> 
> errmsg("statistics expressions and predicates can refer only to the table 
> being indexed")));
> left behind by evaluating the predicate or index expressions.
> Set up for predicate or expression evaluation
> Need an EState for evaluation of index expressions and
> partial-index predicates.  Create it in the per-index context to be
> Fetch function for analyzing index expressions.

Re: PoC/WIP: Extended statistics on expressions

Reply via email to