Hi everyone,

I have some problems with a quite long query and the plan postgreSQL is choosing. The query joins 12 tables and in the WHERE clause I use a IN _expression_ with a lot of identifiers (up to 2000). The problem is that the planner is proposing a seq_scan on two tables 2M rows each (
internalexpressionprofile and expressionprofile)

I have just try this query (after doing a vacuum analyze), in the 'IN' clause there are 1552 identifiers, and the query should return 14K rows.
I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.

explain analyze SELECT DISTINCT rset.replicatesetid, tra.value as value, tra.expressionprofileid, rep.*, epg.expprogeneid,  con.ordinal
FROM expprogene epg JOIN reporter rep ON  (epg.reporterid=rep.reporterid), expressionprofile epro,
transformedexpressionprofile tra, internalexpressionprofile int,
meanvalues mea, replicateset rset, replicateset_condition rsco, condition con,
"CLUSTER" clu, clustertree tre, clusteranalysis an
WHERE epg.expprogeneid IN (80174,84567,...) AND epg.expprogeneid=epro.expprogeneid
AND epro.expressionprofileid=tra.expressionprofileid AND tra.expressionprofileid=int.expressionprofileid
AND int.meanvaluesid=mea.meanvaluesid AND mea.replicatesetid=rset.replicatesetid
AND rset.replicatesetid=rsco.replicatesetid AND rsco.conditionid=con.conditionid
AND tra.clusterid=clu.clusterid AND clu.clustertreeid=tre.clustertreeid AND tre.clustertreeid=an.genetreeid
AND an.clusteranalysisid=1 AND con.clusteranalysisid = an.clusteranalysisid
ORDER BY epg.expprogeneid, con.ordinal;

The plan...

 Unique  (cost=129132.53..129132.59 rows=2 width=150) (actual time=12637.224..12676.016 rows=13968 loops=1)
   ->  Sort  (cost=129132.53..129132.54 rows=2 width=150) (actual time=12637.217..12646.484 rows=13968 loops=1)
         Sort Key: epg.expprogeneid, con.ordinal, rset.replicatesetid, tra.value, tra.expressionprofileid, rep.reporterid, rep.name, rep.anotation, rep.otherinfo, rep.incidences
         ->  Nested Loop  (cost=62927.42..129132.52 rows=2 width=150) (actual time=7112.942..12586.314 rows=13968 loops=1)
               Join Filter: ("outer".genetreeid = "inner".clustertreeid)
               ->  Nested Loop  (cost=62927.42..127893.86 rows=409 width=162) (actual time=7112.864..11960.324 rows=41904 loops=1)
                     ->  Nested Loop  (cost=62927.42..125727.31 rows=369 width=154) (actual time=7112.825..11500.645 rows=13968 loops=1)
                           ->  Merge Join  (cost=3.02..7.70 rows=1 width=12) (actual time=0.057..0.073 rows=1 loops=1)
                                 Merge Cond: ("outer".clustertreeid = "inner".genetreeid)
                                 ->  Index Scan using clustertree_pk on clustertree tre  (cost=0.00..4.35 rows=123 width=4) (actual time=0.017..0.024 rows=2 loops=1)
                                 ->  Sort  (cost=3.02..3.03 rows=1 width=8) (actual time=0.028..0.030 rows=1 loops=1)
                                       Sort Key: an.genetreeid
                                       ->  Index Scan using clusteranalysis_pk on clusteranalysis an  (cost=0.00..3.01 rows=1 width=8) (actual time=0.015..0.018 rows=1 loops=1)
                                             Index Cond: (clusteranalysisid = 1)
                           ->  Hash Join  (cost=62924.39..125715.53 rows=408 width=150) (actual time=7112.758..11455.797 rows=13968 loops=1)
                                 Hash Cond: ("outer".expressionprofileid = "inner".expressionprofileid)
                                 ->  Hash Join  (cost=15413.58..78079.33 rows=24339 width=134) (actual time=1489.347..5721.306 rows=41904 loops=1)
                                       Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
                                       ->  Seq Scan on expressionprofile epro  (cost=0.00..48263.24 rows=2831824 width=8) (actual time=0.039..3097.656 rows=2839676 loops=1)
                                       ->  Hash  (cost=15409.72..15409.72 rows=1546 width=130) (actual time=43.365..43.365 rows=0 loops=1)
                                             ->  Nested Loop  (cost=0.00..15409.72 rows=1546 width=130) (actual time=0.056..40.637 rows=1552 loops=1)
                                                   ->  Index Scan using expprogene_pk, expprogene_pk, [......] on expprogene epg  (cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907 rows=1552 loops=1)
                                                         Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567) OR (expprogeneid = 83608) OR [OR ....])
                                                   ->  Index Scan using reporter_pkey on reporter rep  (cost=0.00..3.03 rows=1 width=126) (actual time=0.009..0.010 rows=1 loops=1552)
                                                         Index Cond: ("outer".reporterid = rep.reporterid)
                                 ->  Hash  (cost=47403.68..47403.68 rows=42853 width=16) (actual time=5623.174..5623.174 rows=0 loops=1)
                                       ->  Hash Join  (cost=2369.91..47403.68 rows=42853 width=16) (actual time=346.040..5538.571 rows=75816 loops=1)
                                             Hash Cond: ("outer".meanvaluesid = "inner".meanvaluesid)
                                             ->  Seq Scan on internalexpressionprofile "int"  (cost=0.00..34506.16 rows=2019816 width=8) (actual time=0.003..2231.427 rows=2019816 loops=1)
                                             ->  Hash  (cost=2262.78..2262.78 rows=42853 width=16) (actual time=345.803..345.803 rows=0 loops=1)
                                                   ->  Nested Loop  (cost=17.49..2262.78 rows=42853 width=16) (actual time=1.965..259.363 rows=75816 loops=1)
                                                         ->  Hash Join  (cost=17.49..28.42 rows=6 width=16) (actual time=1.881..2.387 rows=9 loops=1)
                                                               Hash Cond: ("outer".replicatesetid = "inner".replicatesetid)
                                                               ->  Seq Scan on replicateset rset  (cost=0.00..9.58 rows=258 width=4) (actual time=0.003..0.295 rows=258 loops=1)
                                                               ->  Hash  (cost=17.47..17.47 rows=6 width=12) (actual time=1.575..1.575 rows=0 loops=1)
                                                                     ->  Hash Join  (cost=3.17..17.47 rows=6 width=12) (actual time=0.315..1.557 rows=9 loops=1)
                                                                           Hash Cond: ("outer".conditionid = "inner".conditionid)
                                                                           ->  Seq Scan on replicateset_condition rsco  (cost=0.00..10.83 rows=683 width=8) (actual time=0.004..0.688 rows=683 loops=1)
                                                                           ->  Hash  (cost=3.14..3.14 rows=9 width=12) (actual time=0.059..0.059 rows=0 loops=1)
                                                                                 ->  Index Scan using clustering_analysis_fk on condition con  (cost=0.00..3.14 rows=9 width=12) (actual time=0.019..0.039 rows=9 loops=1)
                                                                                       Index Cond: (clusteranalysisid = 1)
                                                         ->  Index Scan using has_meanvalues_fk on meanvalues mea  (cost=0.00..264.03 rows=8669 width=8) (actual time=0.027..13.032 rows=8424 loops=9)
                                                               Index Cond: ("outer".replicatesetid = mea.replicatesetid)
                     ->  Index Scan using comes_from_raw_fk on transformedexpressionprofile tra  (cost=0.00..5.86 rows=1 width=16) (actual time=0.010..0.018 rows=3 loops=13968)
                           Index Cond: (tra.expressionprofileid = "outer".expressionprofileid)
               ->  Index Scan using _cluster__pk on "CLUSTER" clu  (cost=0.00..3.01 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=41904)
                     Index Cond: ("outer".clusterid = clu.clusterid)
 Total runtime: 12696.289 ms
(48 rows)

I tried setting the enable_seq_scan to off and the query's runtime returned by the explain analyze is 4000ms.
Why postgre is not using the indexes?
What is the real impact of having such a big 'IN' clause?

Thanks in advance,

Luis Cornide

Reply via email to