Github user squito commented on the pull request:
https://github.com/apache/spark/pull/5572#issuecomment-104784328
Shouldn't `CartesianRDD` be changed so that it calls `rdd1.iterator(...,
cacheRemote=true)` (same for `rdd2`)? Or does that happen somewhere and I'm
missing it?
Also I'm worried about making such a broad change just for `cartesian`. Do
we have any good way to evict the newly cached blocks? Also, I'd think of
adding methods to `RDD` only when it serves some broad purposes. I'd strongly
favor @tbertelsen 's "idea 1", of just pulling the repeated compution of
`rdd2.iterator` out of the loop. Then the stored values of `rdd2.iterator`
will get cleaned up as part of normal java GC, and no changes to other parts of
spark are required.
The downsides are that this won't work if `rdd2.iterator` doesn't fit in
memory. (The proposed solution has the same problem, since you still call
`blockResult.data.toArray`, though actually I think the block manager has other
apis that would let you cache the block only if memory was available.) Also
that idea would store `rdd2.iterator` as java objects, where your solution
could use serialization configured at the block level, which could save a lot
of memory.
Nonetheless, I think this is too narrow a problem to deserve such a big
change to cacheing behavior, so I'd really prefer its stuck in just
`CartesianRDD`. You could add an option on whether or not save `rdd2.iterator`
in memory so it would still be possible in those cases when it didn't fit in
memory.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]