GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/18470
[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling
## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow,
UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a
reference in the buffer used `GeneratedMutableProjections` to the actual input
data. Things go wrong when the input object (or its backing bytes) are reusing
for other things. This could happen in window functions when it starts spilling
to disk. When reading the back the spill files the `UnsafeSorterSpillReader`
reuses the buffer to which the `UnsafeRow` points, leading to weird corruption
scenario's. Note that this only happens for aggregate functions that preserve
(parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
This was not seen before, because the spilling logic was not doing actual
spills as much and actually used an in-memory page. This page was not cleaned
up during window processing and made sure unsafe objects point to their own
dedicated memory location. This was changed by
https://github.com/apache/spark/pull/16909, after this PR spark spills more
eagerly.
This PR provides a surgical fix because we are close to releasing Spark
2.2. This change just makes sure that there cannot be any object reuse at the
expensive of a little bit of performance. We will follow-up with a more subtle
solution at a later point.
## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-21258
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18470.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18470
----
commit a41335fd7f297628542214d8e1aebe737bcc0828
Author: Herman van Hovell <[email protected]>
Date: 2017-06-29T21:56:28Z
Fix WindowExec complex object preserving aggregation in combination with
spilling.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]