Hi Folks, I did some performance measurement based on TPC-H recently, and want to bring up some performance issue I observed. Both are related to cartesian join.
1. CartesianProduct implementation. Currently CartesianProduct relies on RDD.cartesian, in which the computation is realized as follows override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = { val currSplit = split.asInstanceOf[CartesianPartition] for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) } >From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. >Which is really heavy and may never finished if n is large, especially when >rdd2 is coming from ShuffleRDD. We should have some optimization on CartesianProduct by caching rightResults. The problem is that we don’t have cleanup hook to unpersist rightResults AFAIK. I think we should have some cleanup hook after query execution. With the hook available, we can easily optimize such Cartesian join. I believe such cleanup hook may also benefit other query optimizations. 2. Unnecessary CartesianProduct join. When we have some queries similar to following (don’t remember the exact form): select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 = d.key3 There will be a cartesian join between a and b. But if we just simply change the table order, for example from a, c, b, d, such cartesian join are eliminated. Without such manual tuning, the query will never finish if a, c are big. But we should not relies on such manual optimization. Please provide your inputs. If they are both valid, I will open liras for each. Thanks. Zhan Zhang --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org