Improve selectivity estimate for range queries

Yuzuko Hosoya Thu, 20 Dec 2018 00:24:05 -0800

Hi,

I found the problem about selectivity estimate for range queries
on master as follows.  This is my test case:


postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i;
INSERT 0 30000
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000;
                                              QUERY PLAN
----------------------------------------------------------------------------------------------------
---
 Seq Scan on test  (cost=0.00..613.00 rows=150 width=12) (actual 
time=0.044..21.371 rows=9999
loops=1)
   Filter: ((id > 0) AND (id < 10000))
   Rows Removed by Filter: 20001
 Planning Time: 0.099 ms
 Execution Time: 37.142 ms
(5 rows)

Although the actual number of rows was 9999, the estimated number
of rows was 150.

So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.

------------
/*
  * Now scan the rangequery pair list.
  */
 while (rqlist != NULL)
 {
     RangeQueryClause *rqnext;

     if (rqlist->have_lobound && rqlist->have_hibound)
     {
         /* Successfully matched a pair of range clauses */
         Selectivity s2;

         /*
          * Exact equality to the default value probably means the
          * selectivity function punted.  This is not airtight but should
          * be good enough.
          */
         if (rqlist->hibound == DEFAULT_INEQ_SEL ||
             rqlist->lobound == DEFAULT_INEQ_SEL)
         {
             s2 = DEFAULT_RANGE_INEQ_SEL;
         }
         else
         {
             s2 = rqlist->hibound + rqlist->lobound - 1.0;
------------

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 
1.0.

My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of 
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached.  Do you have any thoughts?

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

improve_selectivity_estimate_for_range_queries.patch
Description: Binary data

Improve selectivity estimate for range queries

Reply via email to