GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/6415

    [SPARK-7873] [WIP] Fix another bug related to KryoSerializerInstance re-use 
in sort-shuffle

    This is a somewhat obscure bug, but I think that it will seriously impact 
KryoSerializer users who use custom registrators which disabled auto-reset. 
When auto-reset is disabled, then this breaks things in some of our shuffle 
paths which actually end up creating multiple OutputStreams from the same 
shared SerializerInstance (which is unsafe). 
    
    This was introduced by a patch (SPARK-3386) which enables serializer re-use 
in some of the shuffle paths, since constructing new serializer instances is 
actually pretty costly for KryoSerializer.  We had already fixed another 
corner-case (SPARK-7766) bug related to this, but missed this one.  From an 
engineering risk management perspective, we probably should have just reverted 
the original serializer reuse patch and added a big 
cross-product-of-configurations-and-shuffle-managers test suite before 
attempting to fix the defects.
    
    I think that I have a pretty simple fix for this, but we still might want 
to consider a revert for 1.4 just to be safe.
    
    For now, this PR adds a regression test for this bug.  I'll push followup 
commits to add asserts to find any other instances of this bug followed by 
another commit to try to fix this issue.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-7873

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6415
    
----
commit 735088693fd7d02dacaf2173a7c314099a9ee391
Author: Josh Rosen <[email protected]>
Date:   2015-05-26T17:25:01Z

    Add failing regression test for SPARK-7873

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to