Hi all, The purpose of this email is to describe how window functions are computed and to try to come up with "better" ways to do it.
DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added support for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also made some significant improvements to the way Drill computes window functions. The general idea was to update the code to only support the default frame which makes it run faster and use less memory. WindowFrameRecordBatch works similarly to StreamingAggregate: it requires the data to be sorted on the partition and order by columns and only computes one frame at a time. With the default frame we only need to aggregate every row only once. Memory consumption depend on the data, but in general each record batch is kept in memory until we are ready to process all it's rows (which is possible when we find the last peer row of the batch's last row). Drill's external sort can spill to disk if data is too big, and we only need to keep at most one partition's worth of data in memory for the window functions to be computed (when over clause doesn't contain an order by) Each time a batch is ready to be processed we do the following: 1- we start with it's first row (current row) 2- we compute the length of the current row's frame (in this case we find the number of peer rows for the current row), 3- we aggregate (this includes computing the window function values) all rows of the current frame 4- we write the aggregated value in each row of the current frame. 5- We then move to the 1st non peer row which becomes the current row 6- if we didn't reach the end of the current batch go back to 2 With all this in mind, how can we improve the performance of window functions ? Thanks! -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
