Re: [HACKERS] multivariate statistics (v19)

Tomas Vondra Mon, 30 Jan 2017 11:34:17 -0800

On 01/26/2017 10:43 AM, Dilip Kumar wrote:


histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.

I don't think so, for the same reason why ineq_histogram_selectivity()in selfuncs.c uses


    binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even ifwe knew the number of distinct items in the bucket, we have no idea whatthe distribution within the bucket looks like. Maybe 99% of the bucketare covered by a single distinct value, maybe all the items are squashedon one side of the bucket, etc.

Moreover we don't really know the number of distinct values in thebucket - we only know the number of distinct items in the sample, andonly while building the histogram. I don't think it makes much sense toestimate the number of distinct items in a bucket, because the bucketscontain only very few rows so the estimates would be wildly inaccurate.


+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.

No, we're not processing all buckets for each clause. We're' onlyprocessing buckets that were not "ruled out" by preceding clauses.That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) wasevaluated as true. Similarly for OR clauses.

Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

Maybe, but I don't want to do that in the first version. I'm not opposedto doing that in the future, if we find out the v1 histograms are notefficient (I don't think we will, based on tests I did while working onthe patch). Support for other histogram implementations is pretty muchwhy there is 'type' field in the struct.


For now I think we should stick with the simple implementation.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to