Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/5572#issuecomment-105252687
  
    why do you think that just "idea 1" will not result in a speedup?  It will 
also have the desired effect of only fetching the remote blocks once.  So I 
think it will be just as good for improving performance, from the discussion so 
far on why its slow.
    
    To summarize the advantages of each approach:
    
    Idea 1 (just pull `rdd2.iterator.toArray` out of the for loop)
    ====
    * only fetch remote blocks once (so it should be fast)
    * fetched data never goes into Spark's cache, just normal object on heap, 
so it can be gc'ed as soon as its done computing `cartesian`
    * no changes to caching layer or `RDD` interface
    * even if `rdd2` is not cached anywhere (either local or remote), still 
avoids computing it multiple times
    * **con**: if `rdd2` doesn't fit in memory, you'll get an 
`OutOfMemoryException`
    
    
    Idea 2 (add remote caching)
    ====
    * only fetch remote blocks once (so it should be fast)
    * fetched data can be stored in serialized format (if its been stored 
serialized on the remote, it'll be stored serialized on the local as well)
    * If `rdd2` doesn't fit in memory, cache layer will not cache anything.  
(So you won't get a speedup, but at least you wont' get an `OOM`.)  Or if 
`rdd2` happened to be cached to disk, then at least it can always be cached 
locally.
    * **con**: changes to `RDD` interface and caching layer
    * **con**: newly cached blocks, effectively adding some replication, but 
using extra memory.  Users will most likely find that extra caching quite 
unexpected, and also won't have any way to undo it.
    
    Those cons for idea 2 are pretty big,  imo.  And the only advantage is to 
avoid OOMs.  that is definitely nice, but doesn't seem to be the major concern. 
 You could *slightly* help the problem with OOMs from idea 1, by adding another 
arg to `cartesian` to control whether or not to store `rdd2`.  I dunno how much 
that would help much, as the users would be unlikely to know how to set that 
arg -- you'd probably have to have the extra memory usage off by default.  But 
it seems it would at least help in your case, since you know you want that is 
what you want.
    
    You could try to do some hybrid solution, at additional complexity:  inside 
`CartesianRDD`, you could try stick `rdd2.iterator` into spark's cache, and 
then remove it from spark's cache at the end.  and you could even allow the 
full set of caching options, eg. `MEMORY_AND_DISK_SER` etc.  But then you would 
need to do some extra book-keeping, to figure out whether you actually need to 
do any extra caching, what blocks need to be removed afterwards, etc. Honestly 
I am not sure how complex it will be, it is just a vague idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to