[
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931054#comment-15931054
]
Zhan Zhang commented on SPARK-20006:
------------------------------------
The default ShuffledHashJoin threshold can fallback to the broadcast one. A
separate configuration does provide us opportunities to optimize the join
dramatically. It would be great if CBO can automatically find the best
strategy. But probably I miss something. Currently the CBO does not collect
right statistics, especially for partitioned table.
https://issues.apache.org/jira/browse/SPARK-19890
> Separate threshold for broadcast and shuffled hash join
> -------------------------------------------------------
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Zhan Zhang
> Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same
> configuration: AUTO_BROADCASTJOIN_THRESHOLD.
> But the memory model may be different. For broadcast, currently the hash map
> is always build on heap. For shuffledHashJoin, the hash map may be build on
> heap(longHash), or off heap(other map if off heap is enabled). The same
> configuration makes the configuration hard to tune (how to allocate memory
> onheap/offheap). Propose to use different configuration. Please comments
> whether it is reasonable.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]