On 01/26/2017 10:43 AM, Dilip Kumar wrote:

histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.


I don't think so, for the same reason why ineq_histogram_selectivity() in selfuncs.c uses

    binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even if we knew the number of distinct items in the bucket, we have no idea what the distribution within the bucket looks like. Maybe 99% of the bucket are covered by a single distinct value, maybe all the items are squashed on one side of the bucket, etc.

Moreover we don't really know the number of distinct values in the bucket - we only know the number of distinct items in the sample, and only while building the histogram. I don't think it makes much sense to estimate the number of distinct items in a bucket, because the buckets contain only very few rows so the estimates would be wildly inaccurate.


+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
>

No, we're not processing all buckets for each clause. We're' only processing buckets that were not "ruled out" by preceding clauses. That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate (a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was evaluated as true. Similarly for OR clauses.

>
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?


Maybe, but I don't want to do that in the first version. I'm not opposed to doing that in the future, if we find out the v1 histograms are not efficient (I don't think we will, based on tests I did while working on the patch). Support for other histogram implementations is pretty much why there is 'type' field in the struct.

For now I think we should stick with the simple implementation.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to