Hi, I found the problem about selectivity estimate for range queries on master as follows. This is my test case:
postgres=# CREATE TABLE test(id int, val text); CREATE TABLE postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i; INSERT 0 30000 postgres=# VACUUM analyze test; VACUUM postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000; QUERY PLAN ---------------------------------------------------------------------------------------------------- --- Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999 loops=1) Filter: ((id > 0) AND (id < 10000)) Rows Removed by Filter: 20001 Planning Time: 0.099 ms Execution Time: 37.142 ms (5 rows) Although the actual number of rows was 9999, the estimated number of rows was 150. So I dug to see what happened and thought that the following part in clauselist_selectivity caused this problem. ------------ /* * Now scan the rangequery pair list. */ while (rqlist != NULL) { RangeQueryClause *rqnext; if (rqlist->have_lobound && rqlist->have_hibound) { /* Successfully matched a pair of range clauses */ Selectivity s2; /* * Exact equality to the default value probably means the * selectivity function punted. This is not airtight but should * be good enough. */ if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound == DEFAULT_INEQ_SEL) { s2 = DEFAULT_RANGE_INEQ_SEL; } else { s2 = rqlist->hibound + rqlist->lobound - 1.0; ------------ In my environment, the selectivity for id > 0 was 0.99990000000000001, and the selectivity for id < 10000 was 0.33333333333333331. Then, the value of rqlist->hibound and rqlist->lobound are set respectively. On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result, the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement was satisfied. However, these selectivities were computed according to the statistics, so the final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0. My test case might be uncommon, but I think it would cause some problems. To handle such cases I've thought up of an idea based on a previous commit[1] which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag which shows whether the selectivity was calculated according to the statistics or not to clauselist_selectivity, and used it as a condition of the if statement instead of rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. I am afraid that changes were more than I had expected, but we can estimate selectivity correctly. Patch attached. Do you have any thoughts? [1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce Best regards, Yuzuko Hosoya NTT Open Source Software Center
improve_selectivity_estimate_for_range_queries.patch
Description: Binary data