On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Kuntal Ghosh <kuntalghosh.2...@gmail.com> writes: >> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> ... I have to admit that I've failed to wrap my brain around exactly >>> why it's correct. The arguments that I've constructed so far seem to >>> point in the direction of applying the opposite correction, which is >>> demonstrably wrong. Perhaps someone whose college statistics class >>> wasn't quite so long ago can explain this satisfactorily? > >> I guess that you're referring the last case, i.e. >> explain analyze select * from tenk1 where thousand between 10 and 10; > > No, the thing that is bothering me is why it seems to be correct to > apply a positive correction for ">=", a negative correction for "<", > and no correction for "<=" or ">". That seems weird and I can't > construct a plausible explanation for it. I think it might be a > result of the fact that, given a discrete distribution rather than > a continuous one, the histogram boundary values should be understood > as having some "width" rather than being zero-width points on the > distribution axis. But the arguments I tried to fashion on that > basis led to other rules that didn't actually work. > > It's also possible that this logic is in fact wrong and it just happens > to give the right answer anyway for uniformly-distributed cases. > So, here are two points I think: 1. When should we apply(add/subtract) the correction? 2. What should be the correction?
The first point: there can be further two cases, a) histfrac - actual_selectivity(p<=0) = 0. For this case, I've an explanation why applying the correction in above way works correctly. (This is the case with tenk1) Since, histfrac correctly calculates selectivity for (p<=0), hist_selec will either be <= or > (isgt is true). Hence, there is no need to apply the correction. A negative correction is needed for less than operator (sel(<=10) - sel(=10)). Similarly, a positive correction is needed for greater than and equals to operator (sel(>10) + sel(=10)). b) histfrac - actual_selectivity(p<=0) != 0. This is possible when we've high variance in the histogram buckets. I guess here things may go wrong. Please consider the following example, UPDATE tenk1 SET thousand=11 where thousand=10; VACUUM ANALYZE tenk1; explain analyze select * from tenk1 where thousand between 10 and 10; Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.018..0.018 rows=0 loops=1) The second point: In this case, the correction is calculated correctly as selectivity of (p=0) because of uniform distribution. Hence, it works. When we don't have uniform distribution, the current calculation of the correction may prove to be over-estimation for most of the time. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers