Hi Folks,

I did some performance measurement based on TPC-H recently, and want to bring 
up some performance issue I observed. Both are related to cartesian join.

1. CartesianProduct implementation.

Currently CartesianProduct relies on RDD.cartesian, in which the computation is 
realized as follows

  override def compute(split: Partition, context: TaskContext): Iterator[(T, 
U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

>From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
>Which is really heavy and may never finished if n is large, especially when 
>rdd2 is coming from ShuffleRDD.

We should have some optimization on CartesianProduct by caching rightResults. 
The problem is that we don’t have cleanup hook to unpersist rightResults AFAIK. 
I think we should have some cleanup hook after query execution.
With the hook available, we can easily optimize such Cartesian join. I believe 
such cleanup hook may also benefit other query optimizations.


2. Unnecessary CartesianProduct join.

When we have some queries similar to following (don’t remember the exact form):
select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 = 
d.key3

There will be a cartesian join between a and b. But if we just simply change 
the table order, for example from a, c, b, d, such cartesian join are 
eliminated.
Without such manual tuning, the query will never finish if a, c are big. But we 
should not relies on such manual optimization.


Please provide your inputs. If they are both valid, I will open liras for each.

Thanks.

Zhan Zhang

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to