[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274401#comment-14274401 ] Szehon Ho commented on HIVE-7613: - Very belated HNY! I uploaded this pdf as a wiki page, which is more maintainable: [https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Join+Design+Master|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Join+Design+Master]. Its linked and a child of : [DesignDocs|https://cwiki.apache.org/confluence/display/Hive/DesignDocs] page. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Labels: TODOC-SPARK Fix For: spark-branch Attachments: HIve on Spark Map join background.docx, Hive on Spark Join Master Design.pdf, small_table_broadcasting.pdf ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262043#comment-14262043 ] Szehon Ho commented on HIVE-7613: - Thats a good idea, it would be useful. I'll look into that when I get back after New Years Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Fix For: spark-branch Attachments: HIve on Spark Map join background.docx, Hive on Spark Join Master Design.pdf, small_table_broadcasting.pdf ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262057#comment-14262057 ] Lefty Leverenz commented on HIVE-7613: -- Thanks [~szehon], and Happy New Year! I added a TODOC-SPARK label just to help us keep track of this. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Labels: TODOC-SPARK Fix For: spark-branch Attachments: HIve on Spark Map join background.docx, Hive on Spark Join Master Design.pdf, small_table_broadcasting.pdf ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260943#comment-14260943 ] Lefty Leverenz commented on HIVE-7613: -- Should this join design doc be added to the wiki? Or if not, should the existing Hive on Spark: Getting Started include a link to it? * [Hive on Spark: Getting Started | https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started] Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Fix For: spark-branch Attachments: HIve on Spark Map join background.docx, Hive on Spark Join Master Design.pdf, small_table_broadcasting.pdf ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185878#comment-14185878 ] Suhas Satish commented on HIVE-7613: Submitted patch for HIVE-8616. This can be used as the baseline patch for subsequent sub-tasks. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138567#comment-14138567 ] Suhas Satish commented on HIVE-7613: Hi Xuefu, thats a good idea. I was thinking on the lines of calling SparkContext's addFile method in each of the N-1 spark jobs in HashTableSinkOperator.java to write the hash tables as files and then read it in the map-only join job in MapJoinOperator. But that doesn't involve RDDs. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139611#comment-14139611 ] Suhas Satish commented on HIVE-7613: {{ConvertJoinMapJoin}} heavily uses {{OptimizeTezProcContext}} . Although we do have an equivalent {{OptimizeSparkProcContext}}, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned {{SparkConvertJoinMapJoin}} class using {{OptimizeSparkProcContext}} We might need to open a jira for this refactoring. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138560#comment-14138560 ] Xuefu Zhang commented on HIVE-7613: --- Here is what I have in mind: 1. For N-way join being converting to a map join, we can run N-1 Spark jobs, one for each small input to the join (assuming transformation is needed. If not, then we don't need a Spark job). Each job generates some RDD at the end, so we have N-1 RDDs in the end. 2. Dump the content of RDDs into the data structure (hash tables) that's needed by MapJoinOperator. 3. Call SparkContext.broadcast() on that data structure. This will broadcast the data struture to all nodes. 4. Then, we can launch the map only, join job, which can load the broadcasted data structure via the HashTableLoader interface. For more information about Spark's broadcast variable, please refer to http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120697#comment-14120697 ] Suhas Satish commented on HIVE-7613: as a part of this work, we should also enable auto_sortmerge_join_1.q which currently fails with {code:title=auto_sortmerge_join_1.stackTrace|borderStyle=solid} 2014-09-03 16:12:59,607 ERROR [main]: spark.SparkClient (SparkClient.java:execute(166)) - Error executing Spark Plan org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 1, localhost): java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {key:0,value:val_0,ds:2008-04-08} org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:151) org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:47) org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:28) org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:99) scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1177) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1166) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1165) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1165) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1383) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Szehon Ho Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108600#comment-14108600 ] Brock Noland commented on HIVE-7613: As part of this work we should enable auto_sortmerge_join_13.q which currently fails with: {noformat} Done query: auto_sortmerge_join_12.q elapsedTime=8s Begin query: auto_sortmerge_join_13.q java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:455) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:836) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:583) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:175) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:111) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:459) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:836) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:583) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:595) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:175) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:111) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:455) {noformat} Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join[Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101475#comment-14101475 ] Brock Noland commented on HIVE-7613: Note that MapJoin is a stretch goal of M1 and a specific goal of M2. Research optimization of auto convert join to map join[Spark branch] Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Task Components: Spark Reporter: Chengxiang Li Priority: Minor ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join[Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085956#comment-14085956 ] Xuefu Zhang commented on HIVE-7613: --- Thanks for logging this, [~chengxiang li]. This is a good area too look at. Since it's optimization, it doesn't belong to either Milestone 1 or 2. Thus, let's give a lower priority to this one, unless there are no tasks at a higher priority. Research optimization of auto convert join to map join[Spark branch] Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Task Components: Spark Reporter: Chengxiang Li ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)