[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

Suhas Satish (JIRA) Wed, 29 Oct 2014 19:46:08 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189531#comment-14189531
 ]


Suhas Satish commented on HIVE-8621:
------------------------------------

Currently so far in the spark implementation, we are not tagging the small 
tables, but I realized that we need to tag them to be able to use different 
broadcast variables for different tables. 

Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. 

In M/R, we have only one HashTableSink Operator (HTS) for all small tables 
combined. This conversion from RS-> HTS 
happens in LocalMapJoinProcFactory and is  triggered by rule R7  
(MapReduceCompiler: MapJoinFactory.getTableScanMapJoin )    in 
TaskCompiler.optimizeTaskPlan phase. 

Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we 
will end up with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). 
Each of these will generate its broadcast variable. 

After going through Sandy Ryza's spark presentation here, 
http://www.slideshare.net/SandyRyza/spark-job-failures-talk
it looks like the recommended way to distribute compute in spark is to have a 
large number of SparkTasks. So I think its better to have each MapWork from 
each small table as a separate SparkTask. This can be tackled independently in 
this jira if you guys agree 
https://issues.apache.org/jira/browse/HIVE-8622


> Dump small table join data into appropriate number of broadcast variables 
> [Spark Branch]
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-8621
>                 URL: https://issues.apache.org/jira/browse/HIVE-8621
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
>
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number 
> of buckets of tables. If unbucketed, n=1
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

Reply via email to