[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

ConeyLiu Wed, 20 Sep 2017 02:42:52 -0700

Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/17936
  
    rdd1.cartesian(rdd2). For each task we need pool all the data of rdd1 (or 
rdd2) from the cluster. If we have n task running parallel in the same 
executor, that means we need duplicate poll n same data to same executor. This 
can reduce the gc problem and network I/O (maybe disk I/O if the memory and 
disk 
     array can't fit it in memory totally).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

Reply via email to