> -----Original Message-----
> From: Arjen van der Meijden 
> Sent: Wednesday, April 06, 2005 11:53 AM
> To: performance pgsql
> Subject: [PERFORM] Plan for relatively simple query seems to be very
> inefficient
> [...]
> data_main AS dm,
> postcodes AS p
> WHERE dm.range BETWEEN p.range_from AND p.range_till
> [...]
>   Aggregate  (cost=332586.85..332586.85 rows=1 width=0) (actual 
> time=22712.038..22712.039 rows=1 loops=1)
>     ->  Nested Loop  (cost=3.76..328945.96 rows=1456356 
> width=0) (actual 
> time=0.054..22600.826 rows=82688 loops=1)

I'm still a noob at reading EXPLAIN ANALYZE, but it seems to me
that your statistics are throwing off the planner here.  It 
estimates 1.4M and gets 82K, so it's off by a factor of about 20.  
Have you considered doing a VACUUM or upping your statistics?

> [...]
> When I do something completely bogus, which will result in 
> coupling the data per record from data_main on one record from
> postcodes, it still not very fast but acceptable:
> [...]
>   Aggregate  (cost=10076.98..10076.98 rows=1 width=0) (actual 
> time=1456.016..1456.017 rows=1 loops=1)
>     ->  Merge Join  (cost=8636.81..9913.13 rows=65537 
> width=0) (actual 
> time=1058.105..1358.571 rows=81920 loops=1)

Looks like Merge Join is faster than the Nested Loop for this
query.  If you notice, the row counts are a lot closer to the
estimates, too.  This is probably a "good" plan.

> [...]
> Doing something similarily bogus, but with less results is 
> much faster, even though it should have basically the same
> plan:
> data_main AS dm,
> postcodes AS p
> WHERE dm.range  = p.postcode_id
> [...]
>   Aggregate  (cost=2138.63..2138.63 rows=1 width=0) (actual 
> time=180.667..180.668 rows=1 loops=1)
>     ->  Hash Join  (cost=4.00..2087.02 rows=20642 width=0) (actual 
> time=180.645..180.645 rows=0 loops=1)

This one I don't understand at all.  Clearly, the Hash Join is
the way to go, but the estimates are way off (which probably 
explains why this plan isn't chosen in the first place).

>           Hash Cond: ("outer".range = "inner".postcode_id)
>           ->  Seq Scan on data_main dm  (cost=0.00..1262.20 
> rows=81920 
> width=2) (actual time=0.005..105.548 rows=81920 loops=1)
>           ->  Hash  (cost=3.60..3.60 rows=160 width=2) (actual 
> time=0.592..0.592 rows=0 loops=1)
>                 ->  Seq Scan on postcodes p  (cost=0.00..3.60 
> rows=160 
> width=2) (actual time=0.025..0.349 rows=160 loops=1)
>   Total runtime: 180.807 ms
> (7 rows)
> [...]

My completely amateur guess is that the planner is able to use
Merge Join and Hash Join on your contrived queries because you
are only trying to join one field to a single value (i.e.:
operator=).  But the BETWEEN clause is what forces the Nested
Loop.  You can see that here:

                ->  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160 
width=4) (actual time=0.010..0.396 rows=160 loops=1)
vs. here:

          ->  Index Scan using postcodes_pkey on postcodes p 
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160 

So the first query forces a SeqScan on postcodes, while the
second can do an IndexScan.

David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Reply via email to