GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/7057
[SPARK-8638] [SQL] Window Function Performance Improvements
## Description
Performance improvements for Spark Window functions. This PR will also
serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA
tickets SPARK-8638 and SPARK-7712 for more information.
The original work including some benchmarking code for the running case can
be here: https://github.com/hvanhovell/spark-window
## Improvements
* Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current
implementation in spark uses a sliding window approach in these cases. This
means that an aggregate is maintained for every row, so space usage is N (N
being the number of rows). This also means that all these aggregates all need
to be updated separately, this takes N*(N-1)/2 updates. The running case
differs from the Sliding case because we are only adding data to an aggregate
function (no reset is required), we only need to maintain one aggregate (like
in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each
row, and get the aggregate value after each update. This is what the new
implementation does. This approach only uses 1 buffer, and only requires N
updates; I am currently working on data with window sizes of 500-1000 doing
running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED
FOLLOWING case als
o uses this approach and the fact that aggregate operations are communitative,
there is one twist though it will process the input buffer in reverse.
* Fewer comparisons in the sliding case. The current implementation
determines frame boundaries for every input row. The new implementation makes
more use of the fact that the window is sorted, maintains the boundaries, and
only moves them when the current row order changes. This is a minor improvement.
* A single Window node is able to process all types of Frames for the same
Partitioning/Ordering. This saves a little time/memory spent buffering and
managing partitions. This will be enabled in a follow-up PR.
* A lot of the staging code is moved from the execution phase to the
initialization phase. Minor performance improvement, and improves readability
of the execution code.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-8638
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7057.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7057
----
commit f161920218c880e4f2804b19db129936d0d34d61
Author: Herman van Hovell <[email protected]>
Date: 2015-06-27T15:39:53Z
Major overhaul of Window operator.
commit 22e51d39c833d0e69a788ad41b8bd790688b9bd9
Author: Herman van Hovell <[email protected]>
Date: 2015-06-27T15:54:00Z
Fixed "aggregation and range betweens with unbounded" test - two different
window frames were compared.
commit ad7820c6ac04f188e8a6239d314b680c8a6b4551
Author: Herman van Hovell <[email protected]>
Date: 2015-06-27T16:00:23Z
Disabled Tests 42 & 43 because tiny numerical differences in answers.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]