GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15043
[SPARK-17491] Close serialization stream to fix wrong answer bug in
putIteratorAsBytes()
## What changes were proposed in this pull request?
`MemoryStore.putIteratorAsBytes()` may silently lose values when used with
`KryoSerializer` because it does not properly close the serialization stream
before attempting to deserialize the already-serialized values, which may cause
values buffered in Kryo's internal buffers to not be read.
This is the root cause behind a user-reported "wrong answer" bug in PySpark
caching reported by @bennoleslie on the Spark user mailing list in a thread
titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's
automatic use of KryoSerializer for "safe" types (such as byte arrays,
primitives, etc.) this misuse of serializers manifested itself as silent data
corruption rather than a StreamCorrupted error (which you might get from
JavaSerializer).
The minimal fix, implemented here, is to close the serialization stream
before attempting to deserialize written values.
## How was this patch tested?
The original bug was masked by an invalid assert in the memory store test
cases: the old assert compared two results record-by-record with `zip` but
didn't first check that the lengths of the two collections were equal, causing
missing records to go unnoticed.
TODO: I also should add a dedicated test for PartiallySerializedBlock in
its own suite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
partially-serialized-block-values-iterator-bugfix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15043.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15043
----
commit bb6fe380c46e7a0d3dad7623f50d81861a07b1d6
Author: Josh Rosen <[email protected]>
Date: 2016-09-10T06:40:25Z
Enhance existing tests to demonstrate bug.
commit 35a32e712e6a0a18f28ff776e35783ae034084ec
Author: Josh Rosen <[email protected]>
Date: 2016-09-10T06:41:21Z
Minimal (?) fix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]