GitHub user JoshRosen reopened a pull request:
https://github.com/apache/spark/pull/15043
[SPARK-17491] Close serialization stream to fix wrong answer bug in
putIteratorAsBytes()
## What changes were proposed in this pull request?
`MemoryStore.putIteratorAsBytes()` may silently lose values when used with
`KryoSerializer` because it does not properly close the serialization stream
before attempting to deserialize the already-serialized values, which may cause
values buffered in Kryo's internal buffers to not be read.
This is the root cause behind a user-reported "wrong answer" bug in PySpark
caching reported by @bennoleslie on the Spark user mailing list in a thread
titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's
automatic use of KryoSerializer for "safe" types (such as byte arrays,
primitives, etc.) this misuse of serializers manifested itself as silent data
corruption rather than a StreamCorrupted error (which you might get from
JavaSerializer).
The minimal fix, implemented here, is to close the serialization stream
before attempting to deserialize written values.
## How was this patch tested?
The original bug was masked by an invalid assert in the memory store test
cases: the old assert compared two results record-by-record with `zip` but
didn't first check that the lengths of the two collections were equal, causing
missing records to go unnoticed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
partially-serialized-block-values-iterator-bugfix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15043.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15043
----
commit bb6fe380c46e7a0d3dad7623f50d81861a07b1d6
Author: Josh Rosen <[email protected]>
Date: 2016-09-10T06:40:25Z
Enhance existing tests to demonstrate bug.
commit 35a32e712e6a0a18f28ff776e35783ae034084ec
Author: Josh Rosen <[email protected]>
Date: 2016-09-10T06:41:21Z
Minimal (?) fix.
commit 189283d182528cbcfc595be40a96ac65eadeb525
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T20:50:46Z
Merge remote-tracking branch 'origin/master' into
partially-serialized-block-values-iterator-bugfix
commit 5b3e4e315a4890a241d85f36677c423c9f12a099
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T23:51:20Z
Add new test suite and define invariants more clearly.
commit bd38509052d2b4b8a5cf7267ba6150530015225c
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T23:57:05Z
Enforce API contract
commit 2acd2d264f8316be65900d5abbf5913d7bf317c9
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T00:05:36Z
Fix invalid discard() call that could cause double-free of unroll memory.
commit 2f43e69c69e28ae76364155b9c8a178380b55ff3
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T00:06:42Z
Scalastyle.
commit c4e50e6133921f5377deb9f86eb6a1c7653e95a7
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T18:36:08Z
Fix test in MemoryStoreSuite by being more explicit about laziness.
commit ccf929fbfbea6b2ed71c29ae973398235a920ed7
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T19:36:28Z
Enforce that BBOS is closed before use and is not written after close().
commit daf447bdd612289f014d199007ed84049d79a009
Author: Josh Rosen <[email protected]>
Date: 2016-09-14T22:28:13Z
Fix ChunkedByteBufferOutputStreamSuite
commit 4f009d32d73f6c5bcdc5f2b46f01dffedd8cc926
Author: Josh Rosen <[email protected]>
Date: 2016-09-15T23:23:44Z
Remove redundant dispose() call; add comment.
commit 55c5f1542e0a246a02823618143b5c0a456c1e04
Author: Josh Rosen <[email protected]>
Date: 2016-09-15T23:46:33Z
Properly release off-heap memory in PartiallyUnrolledIterator
commit 0d70774e1db04edb46b312efc4b1646d7201fb03
Author: Josh Rosen <[email protected]>
Date: 2016-09-15T23:49:35Z
Fix overly-clever side-effecting method.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]