GitHub user mateiz opened a pull request:
https://github.com/apache/spark/pull/1722
SPARK-2792. Fix reading too much or too little data from each stream in
ExternalMap / Sorter
All these changes are from @mridulm's work in #1609, but extracted here to
fix this specific issue and make it easier to merge not 1.1. This particular
set of changes is to make sure that we read exactly the right range of bytes
from each spill file in EAOM: some serializers can write bytes after the last
object (e.g. the TC_RESET flag in Java serialization) and that would confuse
the previous code into reading it as part of the next batch. There are also
improvements to cleanup to make sure files are closed.
In addition to bringing in the changes to ExternalAppendOnlyMap, I also
copied them to the corresponding code in ExternalSorter and updated its test
suite to test for the same issues.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mateiz/spark spark-2792
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1722.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1722
----
commit 9a78e4b2fdf6ca20667aca478e39ad7fa5a34e11
Author: Matei Zaharia <[email protected]>
Date: 2014-08-01T22:02:13Z
Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
All these changes are from @mridulm's work in #1609, but extracted here
to fix this specific issue. This particular set of changes is to make
sure that we read exactly the right range of bytes from each spill file
in EAOM: some serializers can write bytes after the last object (e.g.
the TC_RESET flag in Java serialization) and that would confuse the
previous code into reading it as part of the next batch. There are also
improvements to cleanup to make sure files are closed.
commit 0d6dad7dc0f38cc7accb89b75c848a2b31fe254c
Author: Matei Zaharia <[email protected]>
Date: 2014-08-01T22:12:28Z
Added Mridul's test changes for ExternalAppendOnlyMap
commit bda37bb431d44c11c097497fd389d6ab2b97c69c
Author: Matei Zaharia <[email protected]>
Date: 2014-08-01T22:38:01Z
Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
Modified ExternalSorterSuite to also set a low object stream reset and
batch size, and verified that it failed before the changes and succeeded
after.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---