Re: [HACKERS] Statistics and selectivity estimation for ranges

Heikki Linnakangas Tue, 14 Aug 2012 08:55:52 -0700

On 14.08.2012 09:45, Alexander Korotkov wrote:

After fixing few more bugs, I've a version with much more reasonable
accuracy.


Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<,>>, &< and &> operators. That seems bogus, in particular for << and &<,because ineq_histogram_selectivity then performs a binary search of thehistogram using those operators. << and &< compare the *upper* bound ofthe value in table against the lower bound of constant, but thehistogram is constructed using regular < operator, which sorts theentries by lower bound. I think the estimates you now get for thoseoperators are quite bogus if there is a non-trivial amount of overlapbetween ranges. For example:


postgres=# create table range_test as

select int4range(-a, a) as r from generate_series(1,1000000) a; analyzerange_test;

SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);

QUERY PLAN


--------------------------------------------------------------------------------
-----------------------------------

Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14)(actual time=0.

060..1340.147 rows=200000 loops=1)
   Filter: (r << '[200000,200001)'::int4range)
   Rows Removed by Filter: 800000
 Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for thoseoperators, if we had a separate histogram of upper bounds. I also notethat the estimation of overlap selectivity could be implemented usingseparate histograms of lower bounds and upper bounds, without requiringa histogram of range lengths, because a && b == NOT (a << b OR a >> b).I'm not sure if the estimates we'd get that way would be better or worsethan your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of alower and upper bound histogram, though, so you'd still need the currentlength histogram method for that.

The code in that traverses the lower bound and length histograms inlockstep looks quite scary. Any ideas on how to simplify that? My firstthought is that there should be helper function that gets a range lengthas argument, and returns the fraction of tuples with length >= argument.It would do the lookup in the length histogram to find the righthistogram bin, and do the linear interpolation within the bin. You'reassuming that length is independent of lower/upper bound, so youshouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helperfunction for each bin to get the fraction of ranges long enough in thatbin, instead dealing with both histograms in the same loop. I think ahelper function like that might simplify those scary loopssignificantly, but I wasn't sure if there's some more intelligence inthe way you combine values from the length and lower bound histogramsthat you couldn't do with such a helper function.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Statistics and selectivity estimation for ranges

Reply via email to