Re: [HACKERS] WIP: multivariate statistics / proof of concept

Heikki Linnakangas Thu, 11 Dec 2014 08:55:08 -0800

On 10/13/2014 01:00 AM, Tomas Vondra wrote:

Hi,


attached is a WIP patch implementing multivariate statistics.


Great! Really glad to see you working on this.

+        * FIXME This sample sizing is mostly OK when computing stats for
+        *       individual columns, but when computing multi-variate stats
+        *       for multivariate stats (histograms, mcv, ...) it's rather
+        *       insufficient. For small number of dimensions it works, but
+        *       for complex stats it'd be nice use sample proportional to
+        *       the table (say, 0.5% - 1%) instead of a fixed size.

I don't think a fraction of the table is appropriate. As long as thesample is random, the accuracy of a sample doesn't depend much on thesize of the population. For example, if you sample 1,000 rows from atable with 100,000 rows, or 1000 rows from a table with 100,000,000rows, the accuracy is pretty much the same. That doesn't change when yougo from a single variable to multiple variables.

You do need a bigger sample with multiple variables, however. My gutfeeling is that if you sample N rows for a single variable, with twovariables you need to sample N^2 rows to get the same accuracy. But it'snot proportional to the table size. (I have no proof for that, but I'msure there is literature on this.)

+ * Multivariate histograms
+ *
+ * Histograms are a collection of buckets, represented by n-dimensional
+ * rectangles. Each rectangle is delimited by an array of lower and
+ * upper boundaries, so that for for the i-th attribute
+ *
+ *     min[i] <= value[i] <= max[i]
+ *
+ * Each bucket tracks frequency (fraction of tuples it contains),
+ * information about the inequalities, number of distinct values in
+ * each dimension (which is used when building the histogram) etc.
+ *
+ * The boundaries may be either inclusive or exclusive, or the whole
+ * dimension may be NULL.
+ *
+ * The buckets may overlap (assuming the build algorithm keeps the
+ * frequencies additive) or may not cover the whole space (i.e. allow
+ * gaps). This entirely depends on the algorithm used to build the
+ * histogram.

That sounds pretty exotic. These buckets are quite different from thesingle-dimension buckets we currently have.

The paper you reference in partition_bucket() function, M.Muralikrishna, David J. DeWitt: Equi-Depth Histograms For EstimatingSelectivity Factors For Multi-Dimensional Queries. SIGMOD Conference1988: 28-36, actually doesn't mention overlapping buckets at all. Ihaven't read the code in detail, but if it implements the algorithm fromthat paper, there will be no overlap.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: multivariate statistics / proof of concept

Reply via email to