GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/13723

    [SPARK-15822][SQL] Prevent byte array backed classes from referencing freed 
memory

    ## What changes were proposed in this pull request?
    `UTF8String` and all `Unsafe*` classes are backed by either on-heap or 
off-heap byte arrays. The code generated version `SortMergeJoin` buffers the 
left hand side join keys during iteration. This was actually problematic in 
off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` 
object) and the left hand side iterator was exhausted (and released its 
memory); the buffered keys would reference freed memory. This causes Seg-faults 
and all kinds of other undefined behavior when we would use one these buffered 
keys.
    
    This PR fixes this problem by creating copies of the buffered variables. I 
have added a general method to the `CodeGenerator` for this. I have checked all 
places in which this could happen, and only `SortMergeJoin` had this problem.
    
    This PR is largely based on the work of @robbinspg and he should be 
credited for this.
    
    closes https://github.com/apache/spark/pull/13707
    
    ## How was this patch tested?
    Manually tested on problematic workloads.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-15822-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13723.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13723
    
----
commit d201034e628456b9640eb8483da849217aa80c92
Author: Pete Robbins <[email protected]>
Date:   2016-06-16T14:33:31Z

    Create copy of UTF8String in SMJ

commit 1288f1e88f90b751cd0640864d7cc6cb5a9dfeca
Author: Pete Robbins <[email protected]>
Date:   2016-06-16T15:21:23Z

    make copy() public

commit d3ddaa9ab31be84d8a9c12ec52706c8529cf5594
Author: Pete Robbins <[email protected]>
Date:   2016-06-16T15:36:25Z

    Fix scalastyle

commit 535cd1591d83e4b98f4bb277b2c119349aef30b2
Author: Pete Robbins <[email protected]>
Date:   2016-06-16T18:43:59Z

    CHeck for data type during generation

commit be0484c8c7cc00812fd626f90f68fc85d1d3e6bc
Author: Herman van Hovell <[email protected]>
Date:   2016-06-16T20:52:17Z

    Merge remote-tracking branch 'apache-github/master' into SPARK-15822-2

commit 8d2d8078bb8c6ddde697b05e1afcf5b4e8812e3d
Author: Herman van Hovell <[email protected]>
Date:   2016-06-17T00:19:37Z

    Move buffering logic into code generator. Use UTF8String clone().

commit 5c33eac8139bcca81a59327cd30d3eac89310bb9
Author: Herman van Hovell <[email protected]>
Date:   2016-06-17T00:41:16Z

    Clean-up

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to