[
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275109#comment-15275109
]
Khaled Hammouda commented on SPARK-2183:
----------------------------------------
Same here. We're running a self-join query on data retrieved from redshift
(using spark-redshift), and we can confirm that the redshift UNLOAD command is
being run twice.
> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
> Key: SPARK-2183
> URL: https://issues.apache.org/jira/browse/SPARK-2183
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Reynold Xin
> Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
> HashJoin [key#3], [key#5], BuildRight
> Exchange (HashPartitioning [key#3:0], 200)
> HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)),
> None
> Exchange (HashPartitioning [key#5:0], 200)
> HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)),
> None
> {code}
> The optimal execution strategy for the above example is to load data only
> once and repartition once.
> If we want to hyper optimize it, we can also have a self join operator that
> builds the hashmap and then simply traverses the hashmap ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]