GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/16574
[SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and re-serialization ## What changes were proposed in this pull request? Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating RDDC, each RDDA partition will be reading by multiple RDDC partition, and RDDB has similar problem. This will cause, when RDDC partition computing, each partition's data in RDDA or RDDB will be repeatedly serialized (then transfer through network), if RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly. In this PR, I change the dependency in `CartesianRDD` from `NarrowDependency` into `ShuffleDependency`, but still keep the way how the parent RDD partitioned. And computing CartesianRDD keep current implementation. ## How was this patch tested? Add a Cartesian test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark optimize_cartesian Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16574.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16574 ---- commit ff7ff0dfb349bea7e41237be1410007a66f1eb28 Author: WeichenXu <weichenxu...@outlook.com> Date: 2017-01-07T02:14:52Z init commit commit 14ba3b24373d7a1d627bbc8b4b3d60ab6a92da07 Author: WeichenXu <weichenxu...@outlook.com> Date: 2017-01-11T04:56:15Z init pr ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org