GitHub user aray opened a pull request:

    https://github.com/apache/spark/pull/16121

    [SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records

    ## What changes were proposed in this pull request?
    
    Fixes a bug in the python implementation of rdd cartesian product related 
to batching that showed up in repeated cartesian products with seemingly random 
results. The root cause being multiple iterators pulling from the same stream 
in the wrong order because of incorrect logic around batching.
    
    `CartesianDeserializer` was changed to implement 
`_load_stream_without_unbatching` and borrow the one line implementation of 
`load_stream` from `BatchedSerializer`. The default implementation of 
`_load_stream_without_unbatching` was changed to give consistent results 
(always an iterable) so that it could be used without additional checks.
    
    `PairDeserializer` was minorly modified to remove inheritance from 
`CartesianDeserializer` as it was no really proper and no longer worked.
    
    Both `CartesianDeserializer` and `PairDeserializer` now only extend 
`Serializer` (which has no `dump_stream` implementation) since they are only 
meant for *de*serialization.
    
    ## How was this patch tested?
    
    Additional unit tests (sourced from #14248)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aray/spark fix-cartesian

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16121
    
----
commit a73c1a2afb0d9ae3838cff8f83bc4c13010a9e66
Author: Andrew Ray <ray.and...@gmail.com>
Date:   2016-12-01T20:20:56Z

    unit test

commit 4ed8c388a9077b89341aaaad57301c241b6b2d2d
Author: Andrew Ray <ray.and...@gmail.com>
Date:   2016-12-02T15:23:11Z

    working

commit a0e36522175bed10a6309b2fe2d37793d746584b
Author: Andrew Ray <ray.and...@gmail.com>
Date:   2016-12-02T15:32:39Z

    remove unneeded debug vars and add comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to