Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Xuefu Zhang Fri, 19 Sep 2014 10:46:01 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review53871
-----------------------------------------------------------



Nice work.

Besides comment below, I think there are some improvement can be done, either 
here or in a different patch:

1. If we have a module that can compile an op tree (given by top ops) into a 
spark task, then we can reuse it after the original op tree is broken into 
several trees. From each tree, we compile it into a spark task. In the end, we 
hook up parent child relation ship. The current logic is a little complicated 
and hard to understand.
2. Tests 
3. Optimizations


ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java
<https://reviews.apache.org/r/25394/#comment93732>

    maybe we can call it opToParentMap?



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
<https://reviews.apache.org/r/25394/#comment93733>

    Comment here?



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java
<https://reviews.apache.org/r/25394/#comment93736>

    We should be able to reuse the hash map by emptyping the previous one.



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93817>

    Let's use meaningful variable names even though they are local.



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93820>

    I feel that the logic here can be simplified. Could we just pop all paths 
and then check if the root is the same and keep doing so until the common 
parent is found?



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93821>

    This seems covering only the case where all FSs have a commont FORWARD 
parent. What if only some of them sharing a FORWARD parent, but other FSs and 
the FORWARD operator sharing some common parent?
    
    I think the rule for whether to break the plan goes like this:
    
    A plan needs to be broken if and only if there are more than one 
FileSinkOperator that can be traced back to a common parent and the tracing has 
to pass a ReduceSinkOperator on the way.



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
<https://reviews.apache.org/r/25394/#comment93847>

    Here we are mapping the children of lca to lca itself. Why is this 
necessary, as you can find the chidren of lca later without the map. Cannot we 
just store lca here?


- Xuefu Zhang


On Sept. 18, 2014, 6:38 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 18, 2014, 6:38 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert. When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 4211a0703f5b6bfd8a628b13864fac75ef4977cf 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> 695d8b90cb1989805a7ff4e39a9635bbcea9c66c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 
> 864965e03a3f9d665e21e1c1b10b19dc286b842f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 76fc290f00430dbc34dbbc1a0cef0d0eb59e6029 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java
>  5fcaf643a0e90fc4acc21187f6d78cefdb1b691a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Reply via email to