Hi all,
hope this is the right list to post to.
We saw some bad choices from the query planner regarding the use of a GIN index
which got worse over time and performance started degrading seriously, so I did
some digging and I found a solution which works, but I'd like to get some
opinion on.
Here is the table in question:
Table "public.games"
Column | Type | Modifiers
| Storage | Stats target | Description
------------------+-----------------------------+----------------------------------------------------+----------+--------------+-------------
id | integer | not null default
nextval('games_id_seq'::regclass) | plain | |
runners | smallint |
| plain | |
player_id | integer |
| plain | 1000 |
partner1_id | integer |
| plain | 1000 |
partner2_id | integer |
| plain | 1000 |
partner3_id | integer |
| plain | 1000 |
created_at | timestamp without time zone |
| plain | |
Indexes:
"games_pkey" PRIMARY KEY, btree (id)
"index_games_on_created_at" btree (created_at)
"index_games_participants" gin ((ARRAY[player_id, partner1_id, partner2_id,
partner3_id])) WITH (fastupdate=off)
Has OIDs: no
I removed some columns from the output for clarity,. It has 300+ million rows.
And is freshly analyzed.
As you see, I've already increased the stats targets for the columns which go
into the GIN index before, but this had no visible effect on query plan choices.
Here's a typical query:
EXPLAIN (analyze, buffers) SELECT "games".* FROM "games" WHERE (ABS(runners)
>= '3') AND ((ARRAY[player_id, partner1_id, partner2_id, partner3_id]) @>
ARRAY[166866]) ORDER BY id DESC LIMIT 20 OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..13639.64 rows=20 width=74) (actual time=330.271..12372.777
rows=20 loops=1)
Buffers: shared hit=3453594 read=119394
-> Index Scan Backward using games_pkey on games (cost=0.57..15526034.64
rows=22767 width=74) (actual time=330.269..12372.763 rows=20 loops=1)
Filter: ((ARRAY[player_id, partner1_id, partner2_id, partner3_id] @>
'{166866}'::integer[]) AND (abs(runners) >= 3::smallint))
Rows Removed by Filter: 3687711
Buffers: shared hit=3453594 read=119394
Total runtime: 12372.848 ms
(7 rows)
This is plan is not the best choice, though. It would be much more efficient to
use the index_games_participants index. For some queries, there would be not
enough records which fullfill the conditions so bascially every row of the
table is scanned.
As \d+ index_games_participants showed that the index had an "array" column, I
found this:
SELECT attname, attstattarget from pg_attribute WHERE attrelid = (SELECT oid
FROM pg_class WHERE relname = 'index_games_participants');
attname | attstattarget
---------+---------------
array | -1
(1 row)
Also, I noticed that for that "array" GIN index column there is content in
pg_statistics, where as for the btree indices there isn't.
Because I didn't find any documentation or references on setting statistic
targets on indices, I just gave it a shot:
ALTER TABLE index_games_participants ALTER COLUMN "array" SET STATISTICS 1000;
After running ANALYZE on the table:
EXPLAIN (analyze, buffers) SELECT "games".* FROM "games" WHERE (ABS(runners)
>= '3') AND ((ARRAY[player_id, partner1_id, partner2_id, partner3_id]) @>
ARRAY[166866]) ORDER BY id DESC LIMIT 20 OFFSET 0;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=33947.27..33947.32 rows=20 width=74) (actual
time=624.308..624.341 rows=20 loops=1)
Buffers: shared hit=4 read=17421
-> Sort (cost=33947.27..33961.61 rows=5736 width=74) (actual
time=624.306..624.318 rows=20 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 27kB
Buffers: shared hit=4 read=17421
-> Bitmap Heap Scan on games (cost=164.49..33794.64 rows=5736
width=74) (actual time=6.704..621.592 rows=1963 loops=1)
Recheck Cond: (ARRAY[player_id, partner1_id, partner2_id,
partner3_id] @> '{166866}'::integer[])
Filter: (abs(runners) >= 3::smallint)
Rows Removed by Filter: 17043
Buffers: shared hit=1 read=17421
-> Bitmap Index Scan on index_games_participants
(cost=0.00..163.05 rows=17207 width=0) (actual time=4.012..4.012 rows=19300
loops=1)
Index Cond: (ARRAY[player_id, partner1_id, partner2_id,
partner3_id] @> '{166866}'::integer[])
Buffers: shared hit=1 read=19
Total runtime: 624.572 ms
(15 rows)
Much better! This reduced the bad plan choices substantially.
Also, as one could expect, SELECT * from pg_statistic WHERE starelid = (SELECT
oid FROM pg_class WHERE relname = 'index_games_participants'); now had much
more data.
Is this a good idea? Am I missing something? Or should the GIN index actually
use the statistic targets derived from the table columns it depends on?
Best,
Dieter
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance