GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/13723
[SPARK-15822][SQL] Prevent byte array backed classes from referencing freed
memory
## What changes were proposed in this pull request?
`UTF8String` and all `Unsafe*` classes are backed by either on-heap or
off-heap byte arrays. The code generated version `SortMergeJoin` buffers the
left hand side join keys during iteration. This was actually problematic in
off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*`
object) and the left hand side iterator was exhausted (and released its
memory); the buffered keys would reference freed memory. This causes Seg-faults
and all kinds of other undefined behavior when we would use one these buffered
keys.
This PR fixes this problem by creating copies of the buffered variables. I
have added a general method to the `CodeGenerator` for this. I have checked all
places in which this could happen, and only `SortMergeJoin` had this problem.
This PR is largely based on the work of @robbinspg and he should be
credited for this.
closes https://github.com/apache/spark/pull/13707
## How was this patch tested?
Manually tested on problematic workloads.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-15822-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13723.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13723
----
commit d201034e628456b9640eb8483da849217aa80c92
Author: Pete Robbins <[email protected]>
Date: 2016-06-16T14:33:31Z
Create copy of UTF8String in SMJ
commit 1288f1e88f90b751cd0640864d7cc6cb5a9dfeca
Author: Pete Robbins <[email protected]>
Date: 2016-06-16T15:21:23Z
make copy() public
commit d3ddaa9ab31be84d8a9c12ec52706c8529cf5594
Author: Pete Robbins <[email protected]>
Date: 2016-06-16T15:36:25Z
Fix scalastyle
commit 535cd1591d83e4b98f4bb277b2c119349aef30b2
Author: Pete Robbins <[email protected]>
Date: 2016-06-16T18:43:59Z
CHeck for data type during generation
commit be0484c8c7cc00812fd626f90f68fc85d1d3e6bc
Author: Herman van Hovell <[email protected]>
Date: 2016-06-16T20:52:17Z
Merge remote-tracking branch 'apache-github/master' into SPARK-15822-2
commit 8d2d8078bb8c6ddde697b05e1afcf5b4e8812e3d
Author: Herman van Hovell <[email protected]>
Date: 2016-06-17T00:19:37Z
Move buffering logic into code generator. Use UTF8String clone().
commit 5c33eac8139bcca81a59327cd30d3eac89310bb9
Author: Herman van Hovell <[email protected]>
Date: 2016-06-17T00:41:16Z
Clean-up
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]