On Tue, 24 Jun 2003, Stephan Szabo wrote:
> >  hs.exon.2=> explain select * from ga_psr_transcript_1 t,
> ga_psr_exon_1e where e.parent = t.id;
> >                                                      QUERY PLAN
> >
> ------------------------------------------------------------------------
> --------------------------------------------
> >  Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Index Scan using ga_psr_transcript_1_pkey on
> ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
> >    ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e
> (cost=0.00..5259.52 rows=176908 width=51)
> > (4 rows)
> >
> > If I do a join on the parent table, the optimizer refuses to use the
> > indicies:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> where e.parent = t.id;
> 
> In this case, you can't use a single index scan to get the rows in order
> so the part that makes the above a nice plan doesn't really apply.  If
> you're getting all the rows and sorting them, index scans are probably a
> waste of time unless you have alot of dead space.  If we supported
> multi-table indexes, that'd potentially let you get a plan like
> the above.

Because of the foreign key constraint, the database engine could do 
the above query on each of the child tables and concatenate the 
results. This is because there is a notion in our schema of "paired" 
inheritance where both the ga_psr_exon ahd ga_psr_transcript tables 
are subclassed by choromosome. IE:
  
   ga_psr_exon_1.parent --> ga_psr_transcript_1.id

Of course the foreign key is the only 
indication of this and I can't say that I'm entirely surprised that 
the optimizer doesn't catch this.

This constraint unfortunately breaks down when joining on a non-foreign 
key, such as range queries (the rows in these tables represent ranges 
in a one dimensional space) like:

 explain select * from ga_psr_exon_1 e1, ga_psr_exon_1 e2 where 
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and 
 e1.stop <= e2.stop;

 Nested Loop  (cost=0.00..995313942.65 rows=386375808 width=102)
   ->  Seq Scan on ga_psr_exon_1 e1  (cost=0.00..3691.08 rows=176908 width=51)
   ->  Index Scan using ga_psr_exon_1_start_stop on ga_psr_exon_1 e2  
(cost=0.00..5582.47 rows=2184 width=51)
         Index Cond: (("outer"."start" >= e2."start") AND ("outer".stop >= e2."start") 
AND ("outer"."start" <= e2.stop) AND ("outer".stop <= e2.stop))

versus

 explain select * from ga_psr_exon e1, ga_psr_exon e2 where 
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and 
 e1.stop <= e2.stop;

which results in a nested loop of seq scans. (It is actually worse 
than this as we would really like to query for overlapping ranges, not 
containment.)

Currently we use either a perl middleware or UNIONs to explicitly force 
these paired table relationships.

> 
> >
> ------------------------------------------------------------------------
> -----------------------
> >  Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
> >          Sort Key: t.id
> >          ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
> [lots of seqscans snipped]
> >    ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
> >          Sort Key: e.parent
> >          ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
> [more seqscans snipped]
> 
> > Same thing even if I'm querying for a specific tuple:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> > where e.parent = t.id and t.id = 123;
> 
> ISTM it's willing to use an index scan on at least some of t's
> subtables.
> Does explicitly saying e.parent=123 help?

Yes, adding e.parent=123 results in the desired result of index scans 
into both tables. However, without including this the optimizer still 
predicts 31 results from the index scans on ga_psr_transcript* and yet 
insists on using a seq scan into each ga_psr_exon* table. It expects 
to get 2065853 rows back from the ga_psr_exon* tables when in reality 
it is more like 310 rows.

Thanks for the response.

-Alan

FWIW, we subclass by chromosome for performance reasons. We have 
tables (at the chromosome level) with upwards of 6 million rows 
against which we run a variety of data mining queries.


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Reply via email to