On Thu, May 14, 2015 at 10:54 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> In any case, the key question if we're to have Paths representing
> higher-level computations is "what do we hang our lists of such Paths
> off of?".

Yeah, I was wondering about that, too.

> If we have say both GROUP BY and LIMIT, it's important to
> distinguish Paths that purport to do only the grouping step from those
> that do both the grouping and the limit.  For the scan/join part of
> planning, we do this by attaching the Paths to RelOptInfos that denote
> various levels of joining ... but how does that translate to the higher
> level processing steps?  Perhaps we could make dummy RelOptInfos that
> correspond to having completed different steps of processing; but I've
> not got a clear idea about how many such RelOptInfos we'd need, and
> in particular not about whether we need to cater for completing those
> steps in different orders.

Well, I'm just shooting from the hip here, but it seems to me that the
basic pipeline as it exists today is Join -> Aggregate -> SetOp ->
Limit -> LockRows. I don't think Limit or LockRows can be moved any
earlier.  SetOps have a lot in common with Aggregates, though, and
might be able to commute.  For instance, if you had an EXCEPT that was
going to knock out most of the rows and also a DISTINCT, you might
want to postpone the DISTINCT until after you do the EXCEPT, or you
might just want to do both at once:

rhaas=# explain select distinct * from (values (1), (1), (2), (3)) x
except select 3;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.06..0.17 rows=4 width=3)
   ->  Append  (cost=0.06..0.16 rows=5 width=3)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.06..0.14 rows=4 width=4)
               ->  HashAggregate  (cost=0.06..0.10 rows=4 width=4)
                     Group Key: "*VALUES*".column1
                     ->  Values Scan on "*VALUES*"  (cost=0.00..0.05
rows=4 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=0)
               ->  Result  (cost=0.00..0.01 rows=1 width=0)
(8 rows)

In an ideal world, we'd only hash once.

And both set operations and aggregates can sometimes commute with
joins, which seems like the stickiest wicket, because there can't be
more than a couple of levels of grouping or aggregation in the same
subquery (GROUP BY, DISTINCT, UNION?) but there can be lots of joins,
and if a particular aggregate can be implemented in lots of places,
things start to get complicated.

Like, if you've got a SELECT DISTINCT ON (a.x) ... FROM
...lots...-type query, I think you can pretty much slap the DISTINCT
on there at any point in the join nest, probably ideally at the point
where the number of rows is the lowest it's going to get, or maybe
when the sortorder is convenient.  For a GROUP BY query with ungrouped
dependent columns, you can do the aggregation before or after those
joins, but you must do it after the joins that are needed to provide
all the values needed for the aggregated columns.  If you model this
with RelOptInfos, you're basically going to need a RelOptInfo to say
"i include these relids and these aggregation or setop operations".
So in the worst case each agg/setop doubles the number of RelOptInfos,
although probably in reality it doesn't because you won't have
infinite flexibility to reorder things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to