Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/17936
rdd1.cartesian(rdd2). For each task we need pool all the data of rdd1 (or
rdd2) from the cluster. If we have n task running parallel in the same
executor, that means we need duplicate poll n same data to same executor. This
can reduce the gc problem and network I/O (maybe disk I/O if the memory and
disk
array can't fit it in memory totally).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]