Re: [HACKERS] left outer join vs subplan

Simon Riggs Thu, 06 Sep 2007 01:22:28 -0700

On Wed, 2007-09-05 at 20:11 +0400, Teodor Sigaev wrote:

> I found two queries which do the same thing but they is very different in 
> time. 
> For test suite it's about 10^3 times, but on real data it can be 10^5 times. 
> It's observed on 8.1-current, 8.2-current and CVS HEAD versions. Interesting 
> that even without LIMIT clause they take approximately the same time, but 
> costs 
> is differ in 30 times. Is any way to tweaking pgsql to produce more 
> reasonable 
> plan for first query?


Times I get are:
Q1: ~950ms
Q2: ~5ms

> This query is auto-generated, so they may be more complex and I choose 
> simplest 
> example.

I think we need to build up a library of autogenerated queries, so we
can do things which address multiple use cases. Can you tell us more
about who/what generated it, so we can research?

The query formulation does seem a fairly common one.

> First query:
> explain analyze
> select *
> from
>      a
>      left outer join (
>          select b.id, sum(b.val)
>          from b
>          group by b.id
>      ) bagg
>          on bagg.id = a.id
> where
>      a.id > 10000
> order by a.addon, a.id
> limit 100;
>   Limit  (cost=9923.36..9923.61 rows=100 width=20) (actual 
> time=2232.437..2233.273 rows=100 loops=1)
>     ->  Sort  (cost=9923.36..10031.41 rows=43221 width=20) (actual 
> time=2232.428..2232.709 rows=100 loops=1)
>           Sort Key: a.addon, a.id
>           Sort Method:  top-N heapsort  Memory: 24kB
>           ->  Merge Right Join  (cost=0.00..8271.48 rows=43221 width=20) 
> (actual 
> time=313.198..2052.559 rows=40000 loops=1)
>                 Merge Cond: (b.id = a.id)
>                 ->  GroupAggregate  (cost=0.00..5725.41 rows=53292 width=12) 
> (actual time=0.266..1422.522 rows=50000 loops=1)
>                       ->  Index Scan using bidx on b  (cost=0.00..4309.26 
> rows=150000 width=12) (actual time=0.217..547.402 rows=150000 loops=1)
>                 ->  Index Scan using a1idx on a  (cost=0.00..1256.90 
> rows=40551 
> width=8) (actual time=0.171..155.073 rows=40000 loops=1)
>                       Index Cond: (a.id > 10000)
>   Total runtime: 2233.940 ms

The value of sum(b.val) is never used in the query, so the aggregate
itself could be discarded. I suspect there are other conditions you
aren't showing us that would make this impossible?

The aggregate prevents the condition bagg.id = a.id from being pushed
down so that we know b.id = a.id. If we knew that then we could use b.id
= ? as an index condition to retrieve the rows.

Since we can't use the best technique, we use another. That then hits a
third optimization problem. When an IndexScan is used to enforce order,
we don't estimate how much of the table needs to be scanned before we
start hitting rows. In the example you give we need to scan 65% of the
table using an IndexScan before we hit any rows. So we would probably be
better off doing a Sort<-SeqScan to apply the condition.

I think we need to do all 3 eventually.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] left outer join vs subplan

Reply via email to