On 10 October 2016 at 11:33, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
>
> Please provide some reason to believe that.  It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one).  So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

I think there's a flaw in Robert's example because the GROUP BY has
columns from either side of the joins, however it's very simple to
come up with a real world case which aggregate push downs can improve.
We can simulate the aggregate push down with a simple subquery.

create table sale (sale_id int primary key,product_id int not
null,quantity int not null);
create table product (product_id int primary key,productdesc
varchar(32) not null);

insert into sale select x,x%100+1,(random()*10)::int+1 from
generate_series(1,10000000) x(x);
insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x);

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join
product p on s.product_id = p.product_id group by p.product_id;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=341558.25..341559.25 rows=100 width=17) (actual
rows=100 loops=1)
   Group Key: p.product_id
   ->  Hash Join  (cost=3.25..291558.25 rows=10000000 width=13)
(actual rows=10000000 loops=1)
         Hash Cond: (s.product_id = p.product_id)
         ->  Seq Scan on sale s  (cost=0.00..154055.00 rows=10000000
width=8) (actual rows=10000000 loops=1)
         ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 13kB
               ->  Seq Scan on product p  (cost=0.00..2.00 rows=100
width=9) (actual rows=100 loops=1)
 Planning time: 0.308 ms
 Execution time: 7568.927 ms
(10 rows)

Time: 7569.842 ms (00:07.570)

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity)
qty from sale group by product_id) s inner join product p on
s.product_id = p.product_id;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Hash Join  (cost=204058.25..204061.63 rows=100 width=17) (actual
rows=100 loops=1)
   Hash Cond: (sale.product_id = p.product_id)
   ->  HashAggregate  (cost=204055.00..204056.00 rows=100 width=12)
(actual rows=100 loops=1)
         Group Key: sale.product_id
         ->  Seq Scan on sale  (cost=0.00..154055.00 rows=10000000
width=8) (actual rows=10000000 loops=1)
   ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Seq Scan on product p  (cost=0.00..2.00 rows=100 width=9)
(actual rows=100 loops=1)
 Planning time: 0.610 ms
 Execution time: 4267.145 ms
(10 rows)

So the pushed down version runs in 56% of the time of the non-pushed
down version. Of course, as mentioned by Robert, both versions would
have to be costed in case the join condition was highly selective

There's a good paper which goes into a bit more details on this
http://www.vldb.org/conf/1995/P345.PDF

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to