[
https://issues.apache.org/jira/browse/SPARK-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Herman van Hovell tot Westerflier updated SPARK-8638:
-----------------------------------------------------
Summary: Window Function Performance Improvements (was: Spark Window
Function Performance Improvements)
> Window Function Performance Improvements
> ----------------------------------------
>
> Key: SPARK-8638
> URL: https://issues.apache.org/jira/browse/SPARK-8638
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Herman van Hovell tot Westerflier
>
> Improve the performance of Spark Window Functions in the following cases:
> # Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED
> PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current
> implementation in spark uses a sliding window approach in these cases. This
> means that an aggregate is maintained for every row, so space usage is N (N
> being the number of rows). This also means that all these aggregates all need
> to be updated separately, this takes N*(N-1)/2 updates. The running case
> differs from the Sliding case because we are only adding data to an aggregate
> function (no reset is required), we only need to maintain one aggregate (like
> in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each
> row, and get the aggregate value after each update. This is what the new
> implementation does. This approach only uses 1 buffer, and only requires N
> updates; I am currently working on data with window sizes of 500-1000 doing
> running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED
> FOLLOWING case also uses this approach and the fact that aggregate operations
> are communitative, there is one twist though it will process the input buffer
> in reverse.
> #. Fewer comparisons in the sliding case. The current implementation
> determines frame boundaries for every input row. The new implementation makes
> more use of the fact that the window is sorted, maintains the boundaries, and
> only moves them when the current row order changes. This is a minor
> improvement.
> # A single Window node is able to process all types of Frames for the same
> Partitioning/Ordering. This saves a little time/memory spent buffering and
> managing partitions. This will be enabled in a follow-up PR.
> # A lot of the staging code is moved from the execution phase to the
> initialization phase. Minor performance improvement, and improves readability
> of the execution code.
> The original work including some benchmarking code for the running case can
> be here: https://github.com/hvanhovell/spark-window
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]