Hello. I have database with events with type from different souces identified by id. I have query which filters events by IN-clause with many ids (1-500 ids). I see poor perfomance of IN-clause and try to investigate this problem.
SELECT version(); version ------------------------------------------------------------------------------------------------------------------- PostgreSQL 10beta2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit -- Full table can fit in memory show shared_buffers; shared_buffers ---------------- 2GB show work_mem; work_mem ---------- 16MB SET max_parallel_workers_per_gather TO 0; SET max_parallel_workers TO 0; -- Create table with 10 000 000 rows with 500 bigints CREATE TABLE ids AS SELECT trunc(random() * 500)::bigint as id from generate_series(1, 10000000); -- IN (...) SELECT ('(' || string_agg(id::text, ',') || ')') AS in_clause FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :in_clause; Aggregate (cost=2654265.02..2654265.03 rows=1 width=8) (actual time=17268.831..17268.831 rows=1 loops=1) Buffers: shared hit=44248 -> Seq Scan on ids (cost=0.00..2644260.48 rows=4001815 width=0) (actual time=0.066..16722.072 rows=3998646 loops=1) Filter: (id = ANY ('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199}'::bigint[])) Rows Removed by Filter: 6001354 Buffers: shared hit=44248 Planning time: 3.324 ms Execution time: 17268.907 ms -- IN (VALUES ...) SELECT ('(VALUES ' || string_agg('(' || id::text || ')', ',') || ')') AS values_clause FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :values_clause; Aggregate (cost=245006.46..245006.47 rows=1 width=8) (actual time=4086.188..4086.188 rows=1 loops=1) Buffers: shared hit=44248 -> Hash Join (cost=7.50..235006.42 rows=4000019 width=0) (actual time=0.978..3557.037 rows=3998646 loops=1) Hash Cond: (ids.id = "*VALUES*".column1) Buffers: shared hit=44248 -> Seq Scan on ids (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.031..1138.542 rows=10000000 loops=1) Buffers: shared hit=44248 -> Hash (cost=5.00..5.00 rows=200 width=4) (actual time=0.923..0.923 rows=200 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 16kB -> HashAggregate (cost=3.00..5.00 rows=200 width=4) (actual time=0.606..0.759 rows=200 loops=1) Group Key: "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..2.50 rows=200 width=4) (actual time=0.003..0.330 rows=200 loops=1) Planning time: 1.094 ms Execution time: 4086.333 ms -- '...'::hstore ? id SELECT ('''' || string_agg(id::text || '=>NULL', ',') || '''::hstore') AS hstore_clause FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :hstore_clause ? id::text; Planning time: 0.206 ms Execution time: 5032.794 ms -- '...'::jsonb ? id SELECT ('''{' || string_agg('"' || id::text || '": null', ',') || '}''::jsonb') AS jsonb_clause FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :jsonb_clause ? id::text; Planning time: 0.114 ms Execution time: 9277.307 ms IN-VALUES clause has the bestest perfomance. So I have some questions: - May be exist better solution? - Does PostgreSQL have support of hashset structure? Extension (I don't found)? - IN-VALUES clause adds new node to plan. Has additional node big overhead? How about filter by two or more IN-VALUES clause? Thanks.