GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/18470

    [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling

    ## What changes were proposed in this pull request?
    `WindowExec` currently improperly stores complex objects (UnsafeRow, 
UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a 
reference in the buffer used `GeneratedMutableProjections` to the actual input 
data. Things go wrong when the input object (or its backing bytes) are reusing 
for other things. This could happen in window functions when it starts spilling 
to disk. When reading the back the spill files the `UnsafeSorterSpillReader` 
reuses the buffer to which the `UnsafeRow` points, leading to weird corruption 
scenario's. Note that this only happens for aggregate functions that preserve 
(parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
    
    This was not seen before, because the spilling logic was not doing actual 
spills as much and actually used an in-memory page. This page was not cleaned 
up during window processing and made sure unsafe objects point to their own 
dedicated memory location. This was changed by 
https://github.com/apache/spark/pull/16909, after this PR spark spills more 
eagerly.
    
    This PR provides a surgical fix because we are close to releasing Spark 
2.2. This change just makes sure that there cannot be any object reuse at the 
expensive of a little bit of performance. We will follow-up with a more subtle 
solution at a later point.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameWindowFunctionsSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-21258

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18470.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18470
    
----
commit a41335fd7f297628542214d8e1aebe737bcc0828
Author: Herman van Hovell <[email protected]>
Date:   2017-06-29T21:56:28Z

    Fix WindowExec complex object preserving aggregation in combination with 
spilling.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to