GitHub user JoshRosen reopened a pull request: https://github.com/apache/spark/pull/15043
[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes() ## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by @bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark partially-serialized-block-values-iterator-bugfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15043.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15043 ---- commit bb6fe380c46e7a0d3dad7623f50d81861a07b1d6 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-10T06:40:25Z Enhance existing tests to demonstrate bug. commit 35a32e712e6a0a18f28ff776e35783ae034084ec Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-10T06:41:21Z Minimal (?) fix. commit 189283d182528cbcfc595be40a96ac65eadeb525 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T20:50:46Z Merge remote-tracking branch 'origin/master' into partially-serialized-block-values-iterator-bugfix commit 5b3e4e315a4890a241d85f36677c423c9f12a099 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T23:51:20Z Add new test suite and define invariants more clearly. commit bd38509052d2b4b8a5cf7267ba6150530015225c Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T23:57:05Z Enforce API contract commit 2acd2d264f8316be65900d5abbf5913d7bf317c9 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-14T00:05:36Z Fix invalid discard() call that could cause double-free of unroll memory. commit 2f43e69c69e28ae76364155b9c8a178380b55ff3 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-14T00:06:42Z Scalastyle. commit c4e50e6133921f5377deb9f86eb6a1c7653e95a7 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-14T18:36:08Z Fix test in MemoryStoreSuite by being more explicit about laziness. commit ccf929fbfbea6b2ed71c29ae973398235a920ed7 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-14T19:36:28Z Enforce that BBOS is closed before use and is not written after close(). commit daf447bdd612289f014d199007ed84049d79a009 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-14T22:28:13Z Fix ChunkedByteBufferOutputStreamSuite commit 4f009d32d73f6c5bcdc5f2b46f01dffedd8cc926 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-15T23:23:44Z Remove redundant dispose() call; add comment. commit 55c5f1542e0a246a02823618143b5c0a456c1e04 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-15T23:46:33Z Properly release off-heap memory in PartiallyUnrolledIterator commit 0d70774e1db04edb46b312efc4b1646d7201fb03 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-15T23:49:35Z Fix overly-clever side-effecting method. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org