Hi,
I'm running into erroneous row estimates when using an array type, and I'm
running out of ideas on how to steer postgres into the right direction... I've
tried setting statistics to 1000, vacuuming and analyzing over and over,
rewriting the query differently... to no avail.
The table looks like this:
create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'
);
It contains 40k rows of random data which, for the sake of the queries that
follow, aren't too far from what they'd contain in production.
# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;
intarr1 | intarr2 | count
--------------+-------------+-------
{0} | {2,3} | 40018
The stats seem right:
# select * from pg_stats where tablename = 'test' and attname = 'intarr1';
schemaname | tablename | attname | inherited | null_frac | avg_width |
n_distinct | most_common_vals | most_common_freqs | histogram_bounds |
correlation
------------+-----------+--------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
test | test | intarr1 | f | 0 | 25 |
1 | {"{0}"} | {1} | | 1
A query without any condition on the array results in reasonable estimates and
the proper use of the index on the sort column:
# explain analyze select * from test order by sortcol limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10
loops=1)
-> Index Scan using test_sortcol_key on test (cost=0.00..12019.08
rows=40018 width=217) (actual time=0.096..0.105 rows=10 loops=1)
Total runtime: 0.200 ms
After adding a condition on the array, however, the row estimate is completely
off:
# explain analyze select * from test where intarr1 && '{0,1}' order
by sortcol limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10
loops=1)
-> Index Scan using test_sortcol_key on test (cost=0.00..12119.13 rows=200
width=217) (actual time=0.129..0.169 rows=10 loops=1)
Filter: (intarr1 && '{0,1}'::integer[])
When there's a condition on both arrays, this then leads to a seq scan:
# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 &&
'{2,4}' order by sortcol limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265
rows=10 loops=1)
-> Sort (cost=3241.28..3241.29 rows=1 width=217) (actual
time=88.258..88.260 rows=10 loops=1)
Sort Key: sortcol
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on test (cost=0.00..3241.27 rows=1 width=217) (actual
time=0.169..68.785 rows=40018 loops=1)
Filter: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 &&
'{2,4}'::integer[]))
Total runtime: 88.339 ms
Adding a GIN index on the two arrays results in similar ugliness:
# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 &&
'{2,4}' order by lft limit 10;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10
loops=1)
-> Sort (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122
rows=10 loops=1)
Sort Key: sortcol
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on test (cost=4.26..8.28 rows=1 width=217)
(actual time=19.635..39.824 rows=40018 loops=1)
Recheck Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 &&
'{2,4}'::integer[]))
-> Bitmap Index Scan on test_intarr1_intarr2_idx
(cost=0.00..4.26 rows=1 width=0) (actual time=19.387..19.387 rows=40018 loops=1)
Index Cond: ((intarr1 && '{0,1}'::integer[]) AND
(intarr2 && '{2,4}'::integer[]))
Total runtime: 56.210 ms
Might this be a bug in the operator's selectivity, or am I doing something
wrong?
Thanks in advance,
Denis
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance