Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/5572#issuecomment-105425372
  
    > You are right if we don't change anything, i.e., don't even implement 
idea one. The advantage is when we only implement idea 1 but not idea two. In 
this case we will fetch each partition in RDD2 once for each partition in RDD1 
(and vice versa).So fewer partitions means fewer repeated fetches, but we still 
need enough partitions to exploit the parallelism.
    
    That makes sense. However, then we will need to consider the case when 
multiple partitions in a same worker. Each partition of RDD2 will be fetched 
more then once. Of course less partitions mean that it is less probable to have 
such case. Best case would be one partition per node.
    
    Another thing we should notice is how we safely process the RDD2 partition 
without causing OOM as @cloud-fan pointed. `unrollSafely` might help it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to