Recenly, I've been researching and putting together a proposal for window
functions. I have not finished this but when I do, I will post it. A nice
list of examples can be found here[1].

Rather than spend a lot of time talking about the problems window
functions present to the planner and executor, I'd like to bring up the
topic of an existing piece of SQL which is well understood and presents a
similar problem. Consider the following query:

select saledate, count(distinct prodid), count(distinct sellerid),
       sum(price) from sales group by 1;

The point here is that aggregates usually just receive the input that the
lower level of the plan generates. When qualified with distinct, however,
that changes. Notice that the count on prodid and sellerid receive
unique input while the sum on price does not.

We do not create seperate plan nodes to achieve this. Rather, it is a
property of the aggregation code inside the executor[2].

It seems to me that by creating actual plan nodes for this distinct step
we can improve the range of options for executing these types of queries.
For example, consider a more complex query than the above that
groups over a join using a join key of saledate, prodid (and that the
planner implements with a merge join). This means that the sort order is
preserved and count(distinct prodid) will receive sorted input. As far as
I can tell, however, the executor doesn't know this and but the planner
does. That is, the sort step inside the aggregate code is redundant.

Another way it could be improved is if we ever introduce a 'hash distinct'
execution node. This has been discussed before.

My interest here is not so much to do with distinct aggregates but,
rather, window functions. Window functions have this same problem as the
input to the functions is generally sorted by different keys.

So, hypothetically, lets say we wanted to create a plan for the above
query which had an explicit Sort -> Unique 'branch' for each of the
distinct aggregates. This is actually difficult to represent with the
existing plan constructs, as it turns out.

Currently, the plan looks something like:

                  Sort Op

What we want to do is have a kind of 'sub plan' for each aggregate. In
effect, the plan might start looking like a directed graph.  Here is part
of the plan as a directed graph.

              |                 |
              |                 |
              ^                 |
              |               Unique
              |                 ^
              |                 |
            Sort               Sort
          (saledate)    (saledate,prodid)
              ^                 ^
              |                 |
              -------------- Fan Out ------------...

This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
'Fan Out' plan. It is trivial to rejoin the data because all data input to
the aggregates is sorted by the same primary key. Also, we could/would
optimize this to perform a single sort for those columns which require the
same sort order (saledate and sum(price) in the example query). An extra
step would be required for (some) window functions because they wont
necessarily preserve order. That's not important now.

An alternative approach would be a 'pipeline' design. This would fit into
the existing plan structure better. In this approach, each aggregate
would be a distinct step in the plan.

                        Finalize (like Result)
                         Agg (sum(price))
                           Sort (saledate)
                         Agg (count(sellerid))
                    Sort (saleid, sellerid)/Unique
                         Agg (count(prodid))
                    Sort (saleid, prodid)/Unique

Now, this would actually work as follows: the input would be received from
the Scan node. We sort by the input by the key saleid, prodid and produce
a unique result. It is input to the aggregate and we produce a result for
a distinct grouping expression. We then pass the output of the aggregate
for this grouping expression up the tree along with a pointer to the scan
data. We do not discard the result of the sort on (saleid, prodid) because
we will need to use it again (for the next grouping expression).

We head up the tree. We produce a new sort, this time sorted on the
key prodid, sellerid. We get the result for the current grouping
expression. We do this until we hit the top level of the plan and produce
a single tuple. This top level continues to put from the plan until it
runs out of data, naturally.

This approach looks inefficient but it should use a similar amount of
resource as we do at the moment. There are optimizations we can make
specifically for window functions and probably for the sample query as
well but that isn't really the point. A query which would generate a plan
like this is:

     select xx.saledate, xx.cp, yy.cs
         (select saledate, count(prodid) as cp
          from (select distinct saledate, prodid from sales) x
          group by saledate
          ) xx,
         (select saledate, count(sellerid) as cs
          from (select distinct saledate, sellerid from sales) y
           group by saledate
          ) yy
     where xx.saledate = yy.saledate;

But, with sharing of the output of the scan on sales.

The question is, what would be the best way for data to flow when you have
aggregates expecting different multisets as their input? Are there other,
better ways to represent it than I have presented here? Which of these
approaches is best (if they are worthwhile at all)?



[2] See advance_aggregates() and process_sorted_aggregate().

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to