[
https://issues.apache.org/jira/browse/HIVE-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885458#comment-15885458
]
Rui Li commented on HIVE-16046:
-------------------------------
Details why we didn't choose broadcast for map join can be found in HIVE-7613.
But I agree we may want to revisit this.
> Broadcasting small table for Hive on Spark
> ------------------------------------------
>
> Key: HIVE-16046
> URL: https://issues.apache.org/jira/browse/HIVE-16046
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
>
> currently the spark plan is
> {code}
> 1. TS(Small table)->Sel/Fil->HashTableSink
>
> 2. TS(Small table)->Sel/Fil->HashTableSink
>
>
> 3. HashTableDummy --
> |
> HashTableDummy --
> |
> RootTS(Big table) ->Sel/Fil ->MapJoin
> -->Sel/Fil ->FileSink
> {code}
> 1. Run the smalltable SparkWorks on Spark cluster, which dump to
> hashmap file
> 2. Run the SparkWork for the big table on Spark cluster. Mappers
> will lookup the smalltable hashmap from the file using HashTableDummy’s
> loader.
> The disadvantage of current implementation is it need long time to distribute
> cache the hash table if the hash table is large. Here want to use
> sparkContext.broadcast() to store small table although it will keep the
> broadcast variable in driver and bring some performance decline on driver.
> [~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)