GitHub user aray opened a pull request:
https://github.com/apache/spark/pull/16121
[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records
## What changes were proposed in this pull request?
Fixes a bug in the python implementation of rdd cartesian product related
to batching that showed up in repeated cartesian products with seemingly random
results. The root cause being multiple iterators pulling from the same stream
in the wrong order because of incorrect logic around batching.
`CartesianDeserializer` was changed to implement
`_load_stream_without_unbatching` and borrow the one line implementation of
`load_stream` from `BatchedSerializer`. The default implementation of
`_load_stream_without_unbatching` was changed to give consistent results
(always an iterable) so that it could be used without additional checks.
`PairDeserializer` was minorly modified to remove inheritance from
`CartesianDeserializer` as it was no really proper and no longer worked.
Both `CartesianDeserializer` and `PairDeserializer` now only extend
`Serializer` (which has no `dump_stream` implementation) since they are only
meant for *de*serialization.
## How was this patch tested?
Additional unit tests (sourced from #14248)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aray/spark fix-cartesian
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16121.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16121
----
commit a73c1a2afb0d9ae3838cff8f83bc4c13010a9e66
Author: Andrew Ray <[email protected]>
Date: 2016-12-01T20:20:56Z
unit test
commit 4ed8c388a9077b89341aaaad57301c241b6b2d2d
Author: Andrew Ray <[email protected]>
Date: 2016-12-02T15:23:11Z
working
commit a0e36522175bed10a6309b2fe2d37793d746584b
Author: Andrew Ray <[email protected]>
Date: 2016-12-02T15:32:39Z
remove unneeded debug vars and add comment
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]