Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Chao Sun Fri, 19 Sep 2014 11:35:37 -0700


> On Sept. 19, 2014, 5:45 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java,
> >  line 142
> > <https://reviews.apache.org/r/25394/diff/3/?file=693788#file693788line142>
> >
> >     Here we are mapping the children of lca to lca itself. Why is this 
> > necessary, as you can find the chidren of lca later without the map. Cannot 
> > we just store lca here?
> 
> Chao Sun wrote:
>     The problem is because we are only generating one FS but multiple TSs. 
> After the FS and the first TS is generated, the relation between child-parent 
> is lost (since the optree is modified), and hence we need to store this 
> information somewhere else, to be used when process the rest TSs.


It might be tricky to just store LCA. When the graph walker reaches a node, it 
needs to check whether that node is a child of LCA, and if so, break the plan.
You could say that since we have LCA, we have all its children info. However, 
after the first child, the children for the LCA are changed, so we need to 
store this info somewhere, IMHO.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review53871
-----------------------------------------------------------


On Sept. 18, 2014, 6:38 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 18, 2014, 6:38 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert. When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 4211a0703f5b6bfd8a628b13864fac75ef4977cf 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> 695d8b90cb1989805a7ff4e39a9635bbcea9c66c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 
> 864965e03a3f9d665e21e1c1b10b19dc286b842f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 76fc290f00430dbc34dbbc1a0cef0d0eb59e6029 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java
>  5fcaf643a0e90fc4acc21187f6d78cefdb1b691a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Reply via email to