Re: [HACKERS] multivariate statistics (v19)

Amit Langote Mon, 12 Dec 2016 03:28:48 -0800

Hi Tomas,

On 2016/10/30 4:23, Tomas Vondra wrote:
> Hi,
> 
> Attached is v20 of the multivariate statistics patch series, doing mostly
> the changes outlined in the preceding e-mail from October 11.
> 
> The patch series currently has these parts:
> 
> * 0001 : (FIX) teach pull_varno about RestrictInfo
> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
> * 0003 : (PATCH) functional dependencies (only the ANALYZE part)
> * 0004 : (PATCH) selectivity estimation using functional dependencies
> * 0005 : (PATCH) multivariate MCV lists
> * 0006 : (PATCH) multivariate histograms
> * 0007 : (WIP) selectivity estimation using ndistinct coefficients
> * 0008 : (WIP) use multiple statistics for estimation
> * 0009 : (WIP) psql tab completion basics


Unfortunately, this failed to compile because of the duplicate_oids error.
Partitioning patch consumed same OIDs as used in this patch.

I will try to read the patches in some more detail, but in the meantime,
here are some comments/nitpicks on the documentation:

No updates to doc/src/sgml/catalogs.sgml?

+  <para>
+   The examples presented in <xref linkend="row-estimation-examples"> used
+   statistics about individual columns to compute selectivity estimates.
+   When estimating conditions on multiple columns, the planner assumes
+   independence and multiplies the selectivities. When the columns are
+   correlated, the independence assumption is violated, and the estimates
+   may be seriously off, resulting in poor plan choices.
+  </para>

The term independence is used in isolation - independence of what?
Independence of the distributions of values in separate columns?  Also,
the phrase "seriously off" could perhaps be replaced by more rigorous
terminology; it might be unclear to some readers.  Perhaps: wildly
inaccurate, :)

+<programlisting>
+EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1;
+                                           QUERY PLAN
+-------------------------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual
time=0.031..2.870 rows=100 loops=1)
+   Filter: (a = 1)
+   Rows Removed by Filter: 9900
+ Planning time: 0.092 ms
+ Execution time: 3.103 ms

Is there a reason why examples in "67.2. Multivariate Statistics" (like
the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation
Examples" (also, other relevant chapters) uses just EXPLAIN.

+   the final 0.01% estimate. The plan however shows that this results in
+   a significant under-estimate, as the actual number of rows matching the

s/under-estimate/underestimate/g

+  <para>
+   For additional details about multivariate statistics, see
+   <filename>src/backend/utils/mvstats/README.statsc</>. There are additional
+   <literal>README</> for each type of statistics, mentioned in the following
+   sections.
+  </para>

Referring to source tree READMEs seems novel around this portion of the
documentation, but I think not too far away, there are some references.
This is under the VII. Internals chapter anyway, so that might be OK.

In any case, s/README.statsc/README.stats/g

Also, s/additional README/additional READMEs/g  (tags omitted for brevity)

+    used in definitions of database normal forms. When simplified, saying
that
+    <literal>b</> is functionally dependent on <literal>a</> means that

Maybe, s/When simplified/In simple terms/g

+    In normalized databases, only functional dependencies on primary keys
+    and super keys are allowed. In practice however many data sets are not
+    fully normalized, for example thanks to intentional denormalization for
+    performance reasons. The table <literal>t</> is an example of a data
+    with functional dependencies. As <literal>a=b</> for all rows in the
+    table, <literal>a</> is functionally dependent on <literal>b</> and
+    <literal>b</> is functionally dependent on <literal>a</literal>.

"super keys" sounds like a new term.

s/for example thanks to/for example, thanks to/g  (or due to instead of
thanks to)

How about: s/an example of a data with/an example of a schema with/g

Perhaps, s/a=b/a = b/g  (additional white space)

+    Similarly to per-column statistics, multivariate statistics are stored in

I notice that "similar to" is used more often than "similarly to".  But
that might be OK.

+     This shows that the statistics is defined on table <structname>t</>,

Perhaps: the statistics is -> the statistics are or the statistic is

+     lists <structfield>attnums</structfield> of the columns (references
+     <structname>pg_attribute</structname>).

While this text may be OK on the catalog description page, it might be
better to expand attnums here as "attribute numbers" dropping the
parenthesized phrase altogether.

+<programlisting>
+SELECT pg_mv_stats_dependencies_show(stadeps)
+  FROM pg_mv_statistic WHERE staname = 's1';
+
+ pg_mv_stats_dependencies_show
+-------------------------------
+ (1) => 2, (2) => 1
+(1 row)
+</programlisting>

Couldn't this somehow show actual column names, instead of attribute numbers?

Will read more later.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to