Github user tbertelsen commented on the pull request:
https://github.com/apache/spark/pull/5572#issuecomment-105293184
@viirya @squito
I think you're raising some very good points.
Let's say that rrd1 has n elements and m partitions. Without any changes
the partitions from rrd2 will be fetched n times. Implementing only idea one
they will be fetched m times. If idea two is also implemented then the
partitions from rrd2 will only be fetched once per executor, assuming there
will be no race conditions with the cache.
m will very likely be equal to total number of threads or some small
multiple there off. If the multiple is two then we will go from 2 fetches per
thread to a fetch per executor. This will come at the cost of in the worst-case
caching rrd2 m times as @squito points out. Given that trade-off it is very
likely that caching is not worth it.
@squito's idea about coalescing the RDDs could help a lot. In general we
will like the local RDD to have as few partitions as possible but are we sure
which of the RDD that are the local one? I'll guess we cannot know that, so we
want both RDDs to have few partitions.
I will suggest that we coalesced by some reasonably value automatically and
then give the user a change to override if is he or she wants. If the number of
executors is higher than the number of threads per executor we should coalesce
RDDs to have one partition per executor. If not we could set the number of
partitions of RDD2 to be `totalThreadCount / numberOfExecutors`.
Use Case
------
I have personally (tried to) used it to analyze all pairs off tens of
thousands of elements. The analyses has been one offs where it was cheaper to
have hundreds of cores running for a full-day than to spend weeks on
implementing an efficient algorithm.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]