[GitHub] spark pull request #16574: [SPARK-19189] Optimize CartesianRDD to avoid part...

WeichenXu123 Fri, 13 Jan 2017 04:21:12 -0800

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/16574


    [SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and 
re-serialization

    ## What changes were proposed in this pull request?
    
    Current CartesianRDD implementation, suppose RDDA cartisian RDDB, 
generating RDDC,
    each RDDA partition will be reading by multiple RDDC partition, and RDDB 
has similar problem.
    This will cause, when RDDC partition computing, each partition's data in 
RDDA or RDDB will be repeatedly serialized (then transfer through network), if 
RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.
    
    In this PR, I change the dependency in `CartesianRDD` from 
`NarrowDependency` into `ShuffleDependency`, but still keep the way how the 
parent RDD partitioned. And computing CartesianRDD keep current implementation.
    
    ## How was this patch tested?
    
    Add a Cartesian test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark optimize_cartesian

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16574
    
----
commit ff7ff0dfb349bea7e41237be1410007a66f1eb28
Author: WeichenXu <[email protected]>
Date:   2017-01-07T02:14:52Z

    init commit

commit 14ba3b24373d7a1d627bbc8b4b3d60ab6a92da07
Author: WeichenXu <[email protected]>
Date:   2017-01-11T04:56:15Z

    init pr

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16574: [SPARK-19189] Optimize CartesianRDD to avoid part...

Reply via email to