Re: Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization

Cheolsoo Park Thu, 07 Nov 2013 23:38:47 -0800


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java, 
> > line 180
> > <https://reviews.apache.org/r/15261/diff/1/?file=379002#file379002line180>
> >
> >     Does this work properly? I was thinking that the key class, comparator, 
> > etc would also need to be a part of this conf. They're not right now but I 
> > thought tez was falling back if no edge conf was present


The key class, comparator, etc are added to the payload of edges by 
setIntermediateInputKeyValue() and setIntermediateOutputKeyValue() calls in 
addCombiner().


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java,
> >  line 76
> > <https://reviews.apache.org/r/15261/diff/1/?file=379007#file379007line76>
> >
> >     Does this still hold true for tez? What about a load + group bys on 
> > different keys?

True. Can we punt this for now?


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java, 
> > lines 306-313
> > <https://reviews.apache.org/r/15261/diff/1/?file=379002#file379002line306>
> >
> >     Move to edge creation

Right now, the vertex pipeline still has POPackage in plan, so setting input 
keys in the payload of vertex is necessary.


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java, 
> > line 297
> > <https://reviews.apache.org/r/15261/diff/1/?file=379002#file379002line297>
> >
> >     This should go with the rest of the edge creation. Anything that's 
> > related to shuffles or input/outputs needs to be done per edge.

Now I create a new configuration object per edge and set up input/output keys 
there. However, I still need to set input/output keys in vertex because I do 
not eliminate local rearrange and package from its pipeline. Please correct me 
if I am misunderstanding.


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java,
> >  line 39
> > <https://reviews.apache.org/r/15261/diff/1/?file=379000#file379000line39>
> >
> >     I think a more global view of things is needed here. For example, I 
> > think a multi-parent node will have trouble. This could show up with a 
> > group by + join on the same key, where the group by may need a combiner.
> >     
> >     It may be necessary to add more information when the TezOps are 
> > constructed.

I actually solved this in a different way. Instead of keeping the reference to 
the previously visited vertex, I use TezOperPlan (parent plan of each 
TezOperator) to retrieve predecessors while visiting a vertex. So now I can 
handle the multi-parent node case nicely.


> On Nov. 6, 2013, 10:54 p.m., Mark Wagner wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java, 
> > line 294
> > <https://reviews.apache.org/r/15261/diff/1/?file=379002#file379002line294>
> >
> >     We can remove this since the work I'm referencing is essentially this 
> > JIRA


- Cheolsoo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15261/#review28317
-----------------------------------------------------------


On Nov. 6, 2013, 11:55 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15261/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2013, 11:55 p.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3555
>     https://issues.apache.org/jira/browse/PIG-3555
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Initial implementation of Tez combiner optimizer. The patch includes the 
> following changes-
> * Factored out CombinerOptimizer code into a utility class called 
> CombinerOptimizerUtil. So both MR and Tez CombinerOptimizer use this utility 
> class instead of duplicating code.
> * Introduced a new class called TezEdgeDescriptor that holds combine plans as 
> well as various edge properties.
> * Added TezEdgeDescriptors to TezOperator. Note that I added multiple 
> descriptors for inbound edges but a single descriptor for all the outbound 
> edges. This is because TezDagBuilder always creates an edge by connecting 
> predecessors to the current vertex. Please let me know if you think we should 
> allow multiple descriptors for outbound edges too.
> * Refactored some code in TezDagBuilder while touching it.
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
>  18a382b 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java 
> e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
> 0b1f3c9 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
> 45e47b0 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java 
> e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 
> 3f14644 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java 
> e612d88 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 
> 5a42ded 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
>  e69de29 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e 
> 
> Diff: https://reviews.apache.org/r/15261/diff/
> 
> 
> Testing
> -------
> 
> ant test-tez passes.
> ant test-e2e-tez passes.
> 
> I didn't add new test cases, but an e2e test case (Checkin_3) includes an 
> algebraic udf (count) following group-by. I also manually tested it on a live 
> cluster.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>

Re: Review Request 15261: PIG-3555 Initial implementation of Tez combiner optimization

Reply via email to