[GitHub] spark pull request: [SPARK-6307][Core] Speed up RDD.cartesian by c...

tbertelsen Mon, 25 May 2015 12:26:22 -0700

Github user tbertelsen commented on the pull request:

    https://github.com/apache/spark/pull/5572#issuecomment-105293184
  
    @viirya  @squito 
    I think you're raising some very good points.
    
    Let's say that rrd1  has n elements and m partitions. Without any changes 
the partitions from rrd2 will be fetched n times. Implementing only idea one 
they will be fetched m times. If idea two is also implemented then the 
partitions from rrd2 will only be fetched once per executor, assuming there 
will be no race conditions with the cache. 
    
    m will very likely be equal to total number of threads or some small 
multiple there off. If the multiple is two  then we will go from 2 fetches per 
thread to a fetch per executor. This will come at the cost of in the worst-case 
caching rrd2 m times as @squito points out.  Given that trade-off it is very 
likely that caching is not worth it.
    
    @squito's idea about coalescing the RDDs could help a lot. In general we 
will like the local RDD to have as few partitions as possible but are we sure 
which of the RDD that are the local one? I'll guess we cannot know that, so we 
want both RDDs to have few partitions. 
    
    I will suggest that we coalesced by some reasonably value automatically and 
then give the user a change to override if is he or she wants. If the number of 
executors is higher than the number of threads per executor we should coalesce 
RDDs to have one partition per executor. If not we could set the number of 
partitions of RDD2 to be `totalThreadCount / numberOfExecutors`. 
    
    
    
     Use Case
    ------
     I have personally (tried to) used it to analyze all pairs off tens of 
thousands of elements. The analyses has been one offs where it was cheaper to 
have hundreds of cores running for a full-day than to spend weeks on 
implementing an efficient algorithm.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6307][Core] Speed up RDD.cartesian by c...

Reply via email to