[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189531#comment-14189531 ]
Suhas Satish commented on HIVE-8621: ------------------------------------ Currently so far in the spark implementation, we are not tagging the small tables, but I realized that we need to tag them to be able to use different broadcast variables for different tables. Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. In M/R, we have only one HashTableSink Operator (HTS) for all small tables combined. This conversion from RS-> HTS happens in LocalMapJoinProcFactory and is triggered by rule R7 (MapReduceCompiler: MapJoinFactory.getTableScanMapJoin ) in TaskCompiler.optimizeTaskPlan phase. Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we will end up with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). Each of these will generate its broadcast variable. After going through Sandy Ryza's spark presentation here, http://www.slideshare.net/SandyRyza/spark-job-failures-talk it looks like the recommended way to distribute compute in spark is to have a large number of SparkTasks. So I think its better to have each MapWork from each small table as a separate SparkTask. This can be tackled independently in this jira if you guys agree https://issues.apache.org/jira/browse/HIVE-8622 > Dump small table join data into appropriate number of broadcast variables > [Spark Branch] > ---------------------------------------------------------------------------------------- > > Key: HIVE-8621 > URL: https://issues.apache.org/jira/browse/HIVE-8621 > Project: Hive > Issue Type: Sub-task > Reporter: Suhas Satish > Assignee: Suhas Satish > > The number of broadcast variables that must be created is m x n where > 'm' is the number of small tables in the (m+1) way join and n is the number > of buckets of tables. If unbucketed, n=1 > This is a sub-task of map-join for spark > https://issues.apache.org/jira/browse/HIVE-7613 > This can use the baseline patch for map-join > https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)