[
https://issues.apache.org/jira/browse/SPARK-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated SPARK-8638:
----------------------------
Assignee: Herman van Hovell
> Window Function Performance Improvements
> ----------------------------------------
>
> Key: SPARK-8638
> URL: https://issues.apache.org/jira/browse/SPARK-8638
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Herman van Hovell
> Assignee: Herman van Hovell
> Fix For: 1.5.0
>
> Attachments: perf_test.scala, perf_test2.scala, perf_test3.scala
>
>
> Improve the performance of Spark Window Functions in the following cases:
> # Much better performance (10x) in the running case (e.g. BETWEEN UNBOUNDED
> PRECEDING AND CURRENT ROW). The current implementation in spark uses a
> sliding window approach in these cases. This means that an aggregate is
> maintained for every row, so space usage is N (N being the number of rows).
> This also means that all these aggregates all need to be updated separately,
> this takes N*(N-1)/2 updates. The running case differs from the Sliding case
> because we are only adding data to an aggregate function (no reset is
> required), we only need to maintain one aggregate (like in the UNBOUNDED
> PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the
> aggregate value after each update. This is what the new implementation does.
> This approach only uses 1 buffer, and only requires N updates; I am currently
> working on data with window sizes of 500-1000 doing running sums and this
> saves a lot of time.
> #. Fewer comparisons in the sliding case. The current implementation
> determines frame boundaries for every input row. The new implementation makes
> more use of the fact that the window is sorted, maintains the boundaries, and
> only moves them when the current row order changes. This is a minor
> improvement.
> # A single Window node is able to process all types of Frames for the same
> Partitioning/Ordering. This saves a little time/memory spent buffering and
> managing partitions. This will be enabled in a follow-up PR.
> # A lot of the staging code is moved from the execution phase to the
> initialization phase. Minor performance improvement, and improves readability
> of the execution code.
> The attached perf_test.scala file contains s number of queries which can be
> used to measure the differences between the current and the proposed window
> function implementation. In the tests the new implementation outperforms the
> current implementation by a factor 7x in sliding window cases, and by a
> factor 14x in the running window cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]