[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005111#comment-15005111
 ] 

Takeshi Yamamuro commented on SPARK-11704:
------------------------------------------

ISTM that some earlier stages in rdd2 are skipped in all the iterations except 
the first one in case of rdd2 comming from ShuffleRDD.
That said, it is worth doing this optimization.

> Optimize the Cartesian Join
> ---------------------------
>
>                 Key: SPARK-11704
>                 URL: https://issues.apache.org/jira/browse/SPARK-11704
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
>     val currSplit = split.asInstanceOf[CartesianPartition]
>     for (x <- rdd1.iterator(currSplit.s1, context);
>          y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to