On Sun, Oct 9, 2016 at 6:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> I think this is fundamentally wrongheaded.  If we go that route,
>>> the only valid relids for any upper path would be the union of all
>>> baserel RTIs, ...
>> Hmm, but this is only true if the upper steps are always done last.
>> Hackers on this list have been hoping to reorder joins with aggregates
>> since at least 2008 - probably sooner, but that's when I started
>> reading this mailing list.   A simple example is:
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
> Please provide some reason to believe that.  It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one).  So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

Hmm, I guess it only works if you know that the order (order_id) is
unique.  If that's true, then each order_line.order_id will match 0 or
1 values in order.order_id.  If you had duplicates in order.order_id,
then SUM(order_line.amount) would effectively be multiplied by the
number of duplicates, so pushing the join down doesn't work.  But if
order (order_id) is unique, then I think it's OK.  For any given value
of order_line.order_id, either every row in the group matches the one
and only row in the order table where that value appears in
order.order_id, or there is no such row and the whole group vanishes.

As for why it should be faster, my intuition is that the aggregate
might be slightly faster because the rows passing through it can be
narrower, but the real win is the join should be quite a lot faster
because the number of rows that need to be joined is probably being
reduced by the aggregation step, possibly quite drastically.

Here's basically that example with 10GB of TPC-H data:

rhaas=# explain analyze select l_orderkey, o_custkey,
sum(l_extendedprice) from lineitem, orders where l_orderkey =
o_orderkey group by 1, 2;

         QUERY PLAN
 GroupAggregate  (cost=15665868.56..17015554.73 rows=59986052
width=48) (actual time=114794.016..156967.012 rows=15000000 loops=1)
   Group Key: lineitem.l_orderkey, orders.o_custkey
   ->  Sort  (cost=15665868.56..15815833.69 rows=59986052 width=24)
(actual time=114793.999..126303.591 rows=59986052 loops=1)
         Sort Key: lineitem.l_orderkey, orders.o_custkey
         Sort Method: external sort  Disk: 2031640kB
         ->  Merge Join  (cost=244.37..4225682.88 rows=59986052
width=24) (actual time=0.020..56368.444 rows=59986052 loops=1)
               Merge Cond: (orders.o_orderkey = lineitem.l_orderkey)
               ->  Index Scan using orders_pkey on orders
(cost=0.43..665749.44 rows=15000000 width=12) (actual
time=0.009..5722.540 rows=15000000 loops=1)
               ->  Index Scan using idx_lineitem_orderkey on lineitem
(cost=0.56..2772694.34 rows=59986052 width=16) (actual
time=0.008..25766.270 rows=59986052 loops=1)
 Planning time: 0.251 ms
 Execution time: 158320.203 ms
(11 rows)

rhaas=# explain analyze select l_orderkey, o_custkey, l_extendedprice
from (select l_orderkey, sum(l_extendedprice) l_extendedprice from
lineitem group by 1) x, orders where l_orderkey = o_orderkey;

 Merge Join  (cost=1.00..3789127.49 rows=378670 width=48) (actual
time=0.104..76107.364 rows=15000000 loops=1)
   Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)
   ->  GroupAggregate  (cost=0.56..3077357.98 rows=378670 width=40)
(actual time=0.071..59870.322 rows=15000000 loops=1)
         Group Key: lineitem.l_orderkey
         ->  Index Scan using idx_lineitem_orderkey on lineitem
(cost=0.56..2772694.34 rows=59986052 width=16) (actual
time=0.057..24382.053 rows=59986052 loops=1)
   ->  Index Scan using orders_pkey on orders  (cost=0.43..665749.44
rows=15000000 width=12) (actual time=0.030..6115.856 rows=15000000
 Planning time: 0.117 ms
 Execution time: 77266.600 ms
(8 rows)

Actually, the major culprit here is that the plan which does the join
first introduces a totally useless sort, but it looks like the second
plan would be modestly faster even the planner were smart enough not
to do that.

If I set work_mem to 1GB and max_workers_per_gather to 8, the time for
the aggregate-first plan improves to 66s, but the join-first plan
doesn't change.

I'm not sure this is actually a great example; 1.5 million output rows
is a lot.  But I've seen this kind of thing plenty of times in the
real world; see also David Rowley's example.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to