Github user squito commented on the pull request:
https://github.com/apache/spark/pull/5572#issuecomment-105252687
why do you think that just "idea 1" will not result in a speedup? It will
also have the desired effect of only fetching the remote blocks once. So I
think it will be just as good for improving performance, from the discussion so
far on why its slow.
To summarize the advantages of each approach:
Idea 1 (just pull `rdd2.iterator.toArray` out of the for loop)
====
* only fetch remote blocks once (so it should be fast)
* fetched data never goes into Spark's cache, just normal object on heap,
so it can be gc'ed as soon as its done computing `cartesian`
* no changes to caching layer or `RDD` interface
* even if `rdd2` is not cached anywhere (either local or remote), still
avoids computing it multiple times
* **con**: if `rdd2` doesn't fit in memory, you'll get an
`OutOfMemoryException`
Idea 2 (add remote caching)
====
* only fetch remote blocks once (so it should be fast)
* fetched data can be stored in serialized format (if its been stored
serialized on the remote, it'll be stored serialized on the local as well)
* If `rdd2` doesn't fit in memory, cache layer will not cache anything.
(So you won't get a speedup, but at least you wont' get an `OOM`.) Or if
`rdd2` happened to be cached to disk, then at least it can always be cached
locally.
* **con**: changes to `RDD` interface and caching layer
* **con**: newly cached blocks, effectively adding some replication, but
using extra memory. Users will most likely find that extra caching quite
unexpected, and also won't have any way to undo it.
Those cons for idea 2 are pretty big, imo. And the only advantage is to
avoid OOMs. that is definitely nice, but doesn't seem to be the major concern.
You could *slightly* help the problem with OOMs from idea 1, by adding another
arg to `cartesian` to control whether or not to store `rdd2`. I dunno how much
that would help much, as the users would be unlikely to know how to set that
arg -- you'd probably have to have the extra memory usage off by default. But
it seems it would at least help in your case, since you know you want that is
what you want.
You could try to do some hybrid solution, at additional complexity: inside
`CartesianRDD`, you could try stick `rdd2.iterator` into spark's cache, and
then remove it from spark's cache at the end. and you could even allow the
full set of caching options, eg. `MEMORY_AND_DISK_SER` etc. But then you would
need to do some extra book-keeping, to figure out whether you actually need to
do any extra caching, what blocks need to be removed afterwards, etc. Honestly
I am not sure how complex it will be, it is just a vague idea.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]