[jira] [Created] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Reynold Xin (JIRA) Wed, 18 Jun 2014 15:48:39 -0700

Reynold Xin created SPARK-2183:
----------------------------------

             Summary: Avoid loading/shuffling data twice in self-join query
                 Key: SPARK-2183
                 URL: https://issues.apache.org/jira/browse/SPARK-2183
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Reynold Xin



{code}
scala> hql("select * from src a join src b on (a.key=b.key)")

res2: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[3] at RDD at SchemaRDD.scala:100
== Query Plan ==
Project [key#3:0,value#4:1,key#5:2,value#6:3]
 HashJoin [key#3], [key#5], BuildRight
  Exchange (HashPartitioning [key#3:0], 200)
   HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
None
  Exchange (HashPartitioning [key#5:0], 200)
   HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
None
{code}

The optimal execution strategy for the above example is to load data only once 
and repartition once. 

If we want to hyper optimize it, we can also have a self join operator that 
builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Reply via email to