GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/7057

    [SPARK-8638] [SQL] Window Function Performance Improvements

    ## Description
    Performance improvements for Spark Window functions. This PR will also 
serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA 
tickets SPARK-8638 and SPARK-7712 for more information.
    
    The original work including some benchmarking code for the running case can 
be here: https://github.com/hvanhovell/spark-window
    
    ## Improvements
    * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current 
implementation in spark uses a sliding window approach in these cases. This 
means that an aggregate is maintained for every row, so space usage is N (N 
being the number of rows). This also means that all these aggregates all need 
to be updated separately, this takes N*(N-1)/2 updates. The running case 
differs from the Sliding case because we are only adding data to an aggregate 
function (no reset is required), we only need to maintain one aggregate (like 
in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each 
row, and get the aggregate value after each update. This is what the new 
implementation does. This approach only uses 1 buffer, and only requires N 
updates; I am currently working on data with window sizes of 500-1000 doing 
running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED 
FOLLOWING case als
 o uses this approach and the fact that aggregate operations are communitative, 
there is one twist though it will process the input buffer in reverse.
    * Fewer comparisons in the sliding case. The current implementation 
determines frame boundaries for every input row. The new implementation makes 
more use of the fact that the window is sorted, maintains the boundaries, and 
only moves them when the current row order changes. This is a minor improvement.
    * A single Window node is able to process all types of Frames for the same 
Partitioning/Ordering. This saves a little time/memory spent buffering and 
managing partitions. This will be enabled in a follow-up PR.
    * A lot of the staging code is moved from the execution phase to the 
initialization phase. Minor performance improvement, and improves readability 
of the execution code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-8638

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7057.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7057
    
----
commit f161920218c880e4f2804b19db129936d0d34d61
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-06-27T15:39:53Z

    Major overhaul of Window operator.

commit 22e51d39c833d0e69a788ad41b8bd790688b9bd9
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-06-27T15:54:00Z

    Fixed "aggregation and range betweens with unbounded" test - two different 
window frames were compared.

commit ad7820c6ac04f188e8a6239d314b680c8a6b4551
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-06-27T16:00:23Z

    Disabled Tests 42 & 43 because tiny numerical differences in answers.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to