[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szehon Ho updated HIVE-8639: ---------------------------- Affects Version/s: spark-branch Status: Patch Available (was: Open) I have a patch for this JIRA. Instead of making a SMB -> MapJoin path, I introduce a new unitied join processor called 'SparkJoinOptimizer' in the logical layer. This will call the SMB or MapJoin optimizers in a certain order depending on the flags that are set and which one works. Thus no need to write a SMB -> MapJoin path. Two issues so far during this refactoring: 1. NonBlockingOpDeDupProc optimizer does not update joinContext, making any SMB optimizer not able to run after it. Submitted patch in HIVE-9060 which should be committed to trunk, but also including it in this spark branch patch. 2. auto_sortmerge_join_9 failure. This was passing until yesterday when bucket-map join is enabled in HIVE-8638. As expected, by choosing MapJoins over SMB join, it may become a BucketMapJoin. Some of the more complicated queries there get converted to BucketMapJoin and fail. Can probably file a new JIRA to fix this test, as its a BucketMapJoin issue. Might need the help of [~jxiang] on this one. Exception is below: {noformat} 2014-12-10 15:31:38,527 WARN [task-result-getter-3]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 1.0 in stage 50.0 (TID 80, 172.19.8.203): java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:207) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:87) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:185) ... 15 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.plan.BucketMapJoinContext.getMappingBigFile(BucketMapJoinContext.java:187) at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.flushToFile(SparkHashTableSinkOperator.java:100) at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:81) ... 21 more {noformat} > Convert SMBJoin to MapJoin [Spark Branch] > ----------------------------------------- > > Key: HIVE-8639 > URL: https://issues.apache.org/jira/browse/HIVE-8639 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Szehon Ho > Assignee: Szehon Ho > Attachments: HIVE-8639.1-spark.patch > > > HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are > partitioned, there could be a slow down as each mapper would need to get a > very small chunk of a partition which has a single key. Thus, in some > scenarios it's beneficial to convert SMB join to map join. > The task is to research and support the conversion from SMB join to map join > for Spark execution engine. See the equivalent of MapReduce in > SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)