[
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136
]
Zhan Zhang commented on SPARK-11704:
------------------------------------
I think we can add a cleanup hook in SQLContext, and when the query is done, we
invoke all the cleanup hook registered. By this way, the CartesianProduct we
can cache the rightResult and register the cleanup handler(unpersist). By this
way, we avoid the recomputation of RDD2.
In my testing, because rdd2 is quite small, I actually reverse the cartesian
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way,
the computation is done quite fast, but the original form cannot be finished.
> Optimize the Cartesian Join
> ---------------------------
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation
> is realized as follows
> override def compute(split: Partition, context: TaskContext): Iterator[(T,
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
> y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
> }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times.
> Which is really heavy and may never finished if n is large, especially when
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults.
> The problem is that we don’t have cleanup hook to unpersist rightResults
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I
> believe such cleanup hook may also benefit other query optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]