GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/6415
[SPARK-7873] [WIP] Fix another bug related to KryoSerializerInstance re-use
in sort-shuffle
This is a somewhat obscure bug, but I think that it will seriously impact
KryoSerializer users who use custom registrators which disabled auto-reset.
When auto-reset is disabled, then this breaks things in some of our shuffle
paths which actually end up creating multiple OutputStreams from the same
shared SerializerInstance (which is unsafe).
This was introduced by a patch (SPARK-3386) which enables serializer re-use
in some of the shuffle paths, since constructing new serializer instances is
actually pretty costly for KryoSerializer. We had already fixed another
corner-case (SPARK-7766) bug related to this, but missed this one. From an
engineering risk management perspective, we probably should have just reverted
the original serializer reuse patch and added a big
cross-product-of-configurations-and-shuffle-managers test suite before
attempting to fix the defects.
I think that I have a pretty simple fix for this, but we still might want
to consider a revert for 1.4 just to be safe.
For now, this PR adds a regression test for this bug. I'll push followup
commits to add asserts to find any other instances of this bug followed by
another commit to try to fix this issue.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-7873
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6415.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6415
----
commit 735088693fd7d02dacaf2173a7c314099a9ee391
Author: Josh Rosen <[email protected]>
Date: 2015-05-26T17:25:01Z
Add failing regression test for SPARK-7873
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]