[ 
https://issues.apache.org/jira/browse/SPARK-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8638:
-----------------------------------

    Assignee:     (was: Apache Spark)

> Window Function Performance Improvements
> ----------------------------------------
>
>                 Key: SPARK-8638
>                 URL: https://issues.apache.org/jira/browse/SPARK-8638
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Herman van Hovell tot Westerflier
>
> Improve the performance of Spark Window Functions in the following cases:
> # Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current 
> implementation in spark uses a sliding window approach in these cases. This 
> means that an aggregate is maintained for every row, so space usage is N (N 
> being the number of rows). This also means that all these aggregates all need 
> to be updated separately, this takes N*(N-1)/2 updates. The running case 
> differs from the Sliding case because we are only adding data to an aggregate 
> function (no reset is required), we only need to maintain one aggregate (like 
> in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each 
> row, and get the aggregate value after each update. This is what the new 
> implementation does. This approach only uses 1 buffer, and only requires N 
> updates; I am currently working on data with window sizes of 500-1000 doing 
> running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED 
> FOLLOWING case also uses this approach and the fact that aggregate operations 
> are communitative, there is one twist though it will process the input buffer 
> in reverse.
> #. Fewer comparisons in the sliding case. The current implementation 
> determines frame boundaries for every input row. The new implementation makes 
> more use of the fact that the window is sorted, maintains the boundaries, and 
> only moves them when the current row order changes. This is a minor 
> improvement.
> # A single Window node is able to process all types of Frames for the same 
> Partitioning/Ordering. This saves a little time/memory spent buffering and 
> managing partitions. This will be enabled in a follow-up PR.
> # A lot of the staging code is moved from the execution phase to the 
> initialization phase. Minor performance improvement, and improves readability 
> of the execution code.
> The original work including some benchmarking code for the running case can 
> be here: https://github.com/hvanhovell/spark-window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to