GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/16574
[SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and
re-serialization
## What changes were proposed in this pull request?
Current CartesianRDD implementation, suppose RDDA cartisian RDDB,
generating RDDC,
each RDDA partition will be reading by multiple RDDC partition, and RDDB
has similar problem.
This will cause, when RDDC partition computing, each partition's data in
RDDA or RDDB will be repeatedly serialized (then transfer through network), if
RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.
In this PR, I change the dependency in `CartesianRDD` from
`NarrowDependency` into `ShuffleDependency`, but still keep the way how the
parent RDD partitioned. And computing CartesianRDD keep current implementation.
## How was this patch tested?
Add a Cartesian test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark optimize_cartesian
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16574.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16574
----
commit ff7ff0dfb349bea7e41237be1410007a66f1eb28
Author: WeichenXu <[email protected]>
Date: 2017-01-07T02:14:52Z
init commit
commit 14ba3b24373d7a1d627bbc8b4b3d60ab6a92da07
Author: WeichenXu <[email protected]>
Date: 2017-01-11T04:56:15Z
init pr
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]