[
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275109#comment-15275109
]
Khaled Hammouda edited comment on SPARK-2183 at 5/7/16 5:41 AM:
----------------------------------------------------------------
Same here. We're running a self-join query on data retrieved from redshift
(using spark-redshift), and we can confirm that the redshift UNLOAD command is
being run twice.
The only way we could avoid the data being loaded twice is to cache the
DataFrame that represents the source data; this causes spark to fetch the data
once.
was (Author: khaledh):
Same here. We're running a self-join query on data retrieved from redshift
(using spark-redshift), and we can confirm that the redshift UNLOAD command is
being run twice.
> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
> Key: SPARK-2183
> URL: https://issues.apache.org/jira/browse/SPARK-2183
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Reynold Xin
> Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
> HashJoin [key#3], [key#5], BuildRight
> Exchange (HashPartitioning [key#3:0], 200)
> HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)),
> None
> Exchange (HashPartitioning [key#5:0], 200)
> HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)),
> None
> {code}
> The optimal execution strategy for the above example is to load data only
> once and repartition once.
> If we want to hyper optimize it, we can also have a self join operator that
> builds the hashmap and then simply traverses the hashmap ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]