[PERFORM] row estimate very wrong for array type

Denis de Bernardy Wed, 04 May 2011 06:41:55 -0700

Hi,

I'm running into erroneous row estimates when using an array type, and I'm 
running out of ideas on how to steer postgres into the right direction... I've 
tried setting statistics to 1000, vacuuming and analyzing over and over, 
rewriting the query differently... to no avail.


The table looks like this:

create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'

);

It contains 40k rows of random data which, for the sake of the queries that 
follow, aren't too far from what they'd contain in production.


# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;


 intarr1 | intarr2 | count 
--------------+-------------+-------
 {0}          | {2,3}       | 40018


The stats seem right:

# select * from pg_stats where tablename = 'test' and attname = 'intarr1';

 schemaname | tablename |   attname    | inherited | null_frac | avg_width | 
n_distinct | most_common_vals | most_common_freqs | histogram_bounds | 
correlation 
------------+-----------+--------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 test       | test      | intarr1 | f         |         0 |        25 |         
 1 | {"{0}"}          | {1}               |                  |           1


A query without any condition on the array results in reasonable estimates and 
the proper use of the index on the sort column:

# explain analyze select * from test order by sortcol limit 10;

                                                             QUERY PLAN         
                                                    
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10 
loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12019.08 
rows=40018 width=217) (actual time=0.096..0.105 rows=10 loops=1)
 Total runtime: 0.200 ms


After adding a condition on the array, however, the row estimate is completely 
off:


# explain analyze select * from test where intarr1 && '{0,1}' order 
by sortcol limit 10;

                                                            QUERY PLAN          
                                                  
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10 
loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12119.13 rows=200 
width=217) (actual time=0.129..0.169 rows=10 loops=1)
         Filter: (intarr1 && '{0,1}'::integer[])


When there's a condition on both arrays, this then leads to a seq scan:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && 
'{2,4}' order by sortcol limit 10;
                                                     QUERY PLAN                 
                                    
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265 
rows=10 loops=1)
   ->  Sort  (cost=3241.28..3241.29 rows=1 width=217) (actual 
time=88.258..88.260 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Seq Scan on test  (cost=0.00..3241.27 rows=1 width=217) (actual 
time=0.169..68.785 rows=40018 loops=1)
               Filter: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && 
'{2,4}'::integer[]))
 Total runtime: 88.339 ms


Adding a GIN index on the two arrays results in similar ugliness:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && 
'{2,4}' order by lft limit 10;
                                                                         QUERY 
PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10 
loops=1)
   ->  Sort  (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122 
rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Bitmap Heap Scan on test  (cost=4.26..8.28 rows=1 width=217) 
(actual time=19.635..39.824 rows=40018 loops=1)
               Recheck Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && 
'{2,4}'::integer[]))
               ->  Bitmap Index Scan on test_intarr1_intarr2_idx  
(cost=0.00..4.26 rows=1 width=0) (actual time=19.387..19.387 rows=40018 loops=1)
                     Index Cond: ((intarr1 && '{0,1}'::integer[]) AND 
(intarr2 && '{2,4}'::integer[]))
 Total runtime: 56.210 ms


Might this be a bug in the operator's selectivity, or am I doing something 
wrong?

Thanks in advance,
Denis


-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] row estimate very wrong for array type

Reply via email to