[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189547#comment-14189547
 ] 

Xuefu Zhang commented on HIVE-8621:
-----------------------------------

[~ssatish] Thanks for sharing your thoughts and findings. We have been 
reevaluating Spark's broadcast variables for the purpose of small tables. 
Spark's broadcast variable works well for small amount of data, but memory 
issues become mounting when broadcasting large amount of the data. For bucket 
join, the table to be broadcast isn't necessary small. To make things worth, 
Spark needs to keep the variable live at the driver, even after the variable is 
broadcast. For this reason, we are considering to use MR's way to broadcast the 
small tables. I'm working on a writeup and create subtasks for this piece. 
Hopefully, we can reuse or clone quite some amount of code.

> Dump small table join data into appropriate number of broadcast variables 
> [Spark Branch]
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-8621
>                 URL: https://issues.apache.org/jira/browse/HIVE-8621
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
>
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number 
> of buckets of tables. If unbucketed, n=1
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to