[
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yash Datta updated SPARK-7097:
------------------------------
Description:
Currently when deciding about whether to create HashJoin or ShuffleHashJoin,
the size estimation of partitioned tables involved considers the size of entire
table. This results in many query plans using shuffle hash joins , where infact
only a small number of partitions may be being referred by the actual query
(due to additional filters), and hence these could be run using BroadCastHash
join.
The query plan should consider the size of only the referred partitions in such
cases
was:
Currently when deciding about whether to create HashJoin or ShuffleHashJoin,
the size estimation of partitioned tables involved considers the size of entire
table. This results in many query plans using shuffle hash joins , where infact
only a small number of partitions may be being referred by the actual query
(due to additional filters), and hence these could be run using Map side hash
join.
The query plan should consider the size of only the referred partitions in such
cases
> Partitioned tables should only consider referred partitions in query during
> size estimation for checking against autoBroadcastJoinThreshold
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-7097
> URL: https://issues.apache.org/jira/browse/SPARK-7097
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
> Reporter: Yash Datta
> Fix For: 1.4.0
>
>
> Currently when deciding about whether to create HashJoin or ShuffleHashJoin,
> the size estimation of partitioned tables involved considers the size of
> entire table. This results in many query plans using shuffle hash joins ,
> where infact only a small number of partitions may be being referred by the
> actual query (due to additional filters), and hence these could be run using
> BroadCastHash join.
> The query plan should consider the size of only the referred partitions in
> such cases
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]