[GitHub] spark pull request: [SPARK-6307][Core] Speed up RDD.cartesian by c...

squito Fri, 22 May 2015 14:52:33 -0700

Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/5572#issuecomment-104784328
  
    Shouldn't `CartesianRDD` be changed so that it calls `rdd1.iterator(..., 
cacheRemote=true)` (same for `rdd2`)?  Or does that happen somewhere and I'm 
missing it?
    
    Also I'm worried about making such a broad change just for `cartesian`.  Do 
we have any good way to evict the newly cached blocks?  Also, I'd think of 
adding methods to `RDD` only when it serves some broad purposes.  I'd strongly 
favor @tbertelsen 's "idea 1", of just pulling the repeated compution of 
`rdd2.iterator` out of the loop.  Then the stored values of `rdd2.iterator` 
will get cleaned up as part of normal java GC, and no changes to other parts of 
spark are required.
    
    The downsides are that this won't work if `rdd2.iterator` doesn't fit in 
memory.  (The proposed solution has the same problem, since you still call 
`blockResult.data.toArray`, though actually I think the block manager has other 
apis that would let you cache the block only if memory was available.)  Also 
that idea would store `rdd2.iterator` as java objects, where your solution 
could use serialization configured at the block level, which could save a lot 
of memory.
    
    Nonetheless, I think this is too narrow a problem to deserve such a big 
change to cacheing behavior, so I'd really prefer its stuck in just 
`CartesianRDD`.  You could add an option on whether or not save `rdd2.iterator` 
in memory so it would still be possible in those cases when it didn't fit in 
memory.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6307][Core] Speed up RDD.cartesian by c...

Reply via email to