Recenly, I've been researching and putting together a proposal for window functions. I have not finished this but when I do, I will post it. A nice list of examples can be found here[1].
Rather than spend a lot of time talking about the problems window functions present to the planner and executor, I'd like to bring up the topic of an existing piece of SQL which is well understood and presents a similar problem. Consider the following query: select saledate, count(distinct prodid), count(distinct sellerid), sum(price) from sales group by 1; The point here is that aggregates usually just receive the input that the lower level of the plan generates. When qualified with distinct, however, that changes. Notice that the count on prodid and sellerid receive unique input while the sum on price does not. We do not create seperate plan nodes to achieve this. Rather, it is a property of the aggregation code inside the executor[2]. It seems to me that by creating actual plan nodes for this distinct step we can improve the range of options for executing these types of queries. For example, consider a more complex query than the above that groups over a join using a join key of saledate, prodid (and that the planner implements with a merge join). This means that the sort order is preserved and count(distinct prodid) will receive sorted input. As far as I can tell, however, the executor doesn't know this and but the planner does. That is, the sort step inside the aggregate code is redundant. Another way it could be improved is if we ever introduce a 'hash distinct' execution node. This has been discussed before. My interest here is not so much to do with distinct aggregates but, rather, window functions. Window functions have this same problem as the input to the functions is generally sorted by different keys. So, hypothetically, lets say we wanted to create a plan for the above query which had an explicit Sort -> Unique 'branch' for each of the distinct aggregates. This is actually difficult to represent with the existing plan constructs, as it turns out. Currently, the plan looks something like: GroupAggregate ^ | Sort Op ^ | Scan What we want to do is have a kind of 'sub plan' for each aggregate. In effect, the plan might start looking like a directed graph. Here is part of the plan as a directed graph. GroupAggregate /-----------------^---------------... | | | | ^ | | Unique | ^ | | Sort Sort (saledate) (saledate,prodid) ^ ^ | | -------------- Fan Out ------------... ^ | Scan This idea was presented by Brian Hagenbuch at Greenplum. He calls it a 'Fan Out' plan. It is trivial to rejoin the data because all data input to the aggregates is sorted by the same primary key. Also, we could/would optimize this to perform a single sort for those columns which require the same sort order (saledate and sum(price) in the example query). An extra step would be required for (some) window functions because they wont necessarily preserve order. That's not important now. An alternative approach would be a 'pipeline' design. This would fit into the existing plan structure better. In this approach, each aggregate would be a distinct step in the plan. Finalize (like Result) ^ | Agg (sum(price)) ^ | Sort (saledate) ^ | Agg (count(sellerid)) ^ | Sort (saleid, sellerid)/Unique ^ | Agg (count(prodid)) ^ | Sort (saleid, prodid)/Unique ^ | Scan Now, this would actually work as follows: the input would be received from the Scan node. We sort by the input by the key saleid, prodid and produce a unique result. It is input to the aggregate and we produce a result for a distinct grouping expression. We then pass the output of the aggregate for this grouping expression up the tree along with a pointer to the scan data. We do not discard the result of the sort on (saleid, prodid) because we will need to use it again (for the next grouping expression). We head up the tree. We produce a new sort, this time sorted on the key prodid, sellerid. We get the result for the current grouping expression. We do this until we hit the top level of the plan and produce a single tuple. This top level continues to put from the plan until it runs out of data, naturally. This approach looks inefficient but it should use a similar amount of resource as we do at the moment. There are optimizations we can make specifically for window functions and probably for the sample query as well but that isn't really the point. A query which would generate a plan like this is: select xx.saledate, xx.cp, yy.cs from (select saledate, count(prodid) as cp from (select distinct saledate, prodid from sales) x group by saledate ) xx, (select saledate, count(sellerid) as cs from (select distinct saledate, sellerid from sales) y group by saledate ) yy where xx.saledate = yy.saledate; But, with sharing of the output of the scan on sales. The question is, what would be the best way for data to flow when you have aggregates expecting different multisets as their input? Are there other, better ways to represent it than I have presented here? Which of these approaches is best (if they are worthwhile at all)? Thanks, Gavin [1] http://www.akadia.com/services/ora_analytic_functions.html [2] See advance_aggregates() and process_sorted_aggregate(). ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend