GitHub user JoshRosen reopened a pull request:

    https://github.com/apache/spark/pull/15043

    [SPARK-17491] Close serialization stream to fix wrong answer bug in 
putIteratorAsBytes()

    ## What changes were proposed in this pull request?
    
    `MemoryStore.putIteratorAsBytes()` may silently lose values when used with 
`KryoSerializer` because it does not properly close the serialization stream 
before attempting to deserialize the already-serialized values, which may cause 
values buffered in Kryo's internal buffers to not be read.
    
    This is the root cause behind a user-reported "wrong answer" bug in PySpark 
caching reported by @bennoleslie on the Spark user mailing list in a thread 
titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's 
automatic use of KryoSerializer for "safe" types (such as byte arrays, 
primitives, etc.) this misuse of serializers manifested itself as silent data 
corruption rather than a StreamCorrupted error (which you might get from 
JavaSerializer). 
    
    The minimal fix, implemented here, is to close the serialization stream 
before attempting to deserialize written values.
    
    ## How was this patch tested?
    
    The original bug was masked by an invalid assert in the memory store test 
cases: the old assert compared two results record-by-record with `zip` but 
didn't first check that the lengths of the two collections were equal, causing 
missing records to go unnoticed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
partially-serialized-block-values-iterator-bugfix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15043.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15043
    
----
commit bb6fe380c46e7a0d3dad7623f50d81861a07b1d6
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-10T06:40:25Z

    Enhance existing tests to demonstrate bug.

commit 35a32e712e6a0a18f28ff776e35783ae034084ec
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-10T06:41:21Z

    Minimal (?) fix.

commit 189283d182528cbcfc595be40a96ac65eadeb525
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T20:50:46Z

    Merge remote-tracking branch 'origin/master' into 
partially-serialized-block-values-iterator-bugfix

commit 5b3e4e315a4890a241d85f36677c423c9f12a099
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T23:51:20Z

    Add new test suite and define invariants more clearly.

commit bd38509052d2b4b8a5cf7267ba6150530015225c
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T23:57:05Z

    Enforce API contract

commit 2acd2d264f8316be65900d5abbf5913d7bf317c9
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-14T00:05:36Z

    Fix invalid discard() call that could cause double-free of unroll memory.

commit 2f43e69c69e28ae76364155b9c8a178380b55ff3
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-14T00:06:42Z

    Scalastyle.

commit c4e50e6133921f5377deb9f86eb6a1c7653e95a7
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-14T18:36:08Z

    Fix test in MemoryStoreSuite by being more explicit about laziness.

commit ccf929fbfbea6b2ed71c29ae973398235a920ed7
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-14T19:36:28Z

    Enforce that BBOS is closed before use and is not written after close().

commit daf447bdd612289f014d199007ed84049d79a009
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-14T22:28:13Z

    Fix ChunkedByteBufferOutputStreamSuite

commit 4f009d32d73f6c5bcdc5f2b46f01dffedd8cc926
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-15T23:23:44Z

    Remove redundant dispose() call; add comment.

commit 55c5f1542e0a246a02823618143b5c0a456c1e04
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-15T23:46:33Z

    Properly release off-heap memory in PartiallyUnrolledIterator

commit 0d70774e1db04edb46b312efc4b1646d7201fb03
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-15T23:49:35Z

    Fix overly-clever side-effecting method.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to