-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15261/
-----------------------------------------------------------

(Updated Nov. 8, 2013, 7:57 a.m.)


Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
Palaniswamy.


Changes
-------

Updated patch includes the following improvements-
# CombinerOptimizer stores TezOperPlan (parent plan of TezOperators), so I can 
retrieve predecessor vertices via getPredecessors(). To detect the 
POLocalRearrange -> POPackage pattern, 1) I find POPackage in the current 
vertex, and 2) I look up predecessors to find POLocalRearrange. If I find them 
in both 1 and 2, I add combiners.
# Combine plan is set to both input and output descriptors (as per Rohini's 
suggestion). Since they're identical, I do not print them twice in the explain 
output.
# TestCombiner unit tests pass with TezMiniCluster. The unit tests remain 
almost unchanged except that 1) PigServer dynamically takes a exec type via a 
system property, and 2) PigServer.shutdown() is called after each test case to 
delete temporary files.

Only remaining issues that I haven't addressed are what's related to multiple 
input/outputs. I am wondering whether we can resolve them in a separate jira 
given that Mark and Rohini are actively working on them. What do others think?


Bugs: PIG-3555
    https://issues.apache.org/jira/browse/PIG-3555


Repository: pig-git


Description
-------

Initial implementation of Tez combiner optimizer. The patch includes the 
following changes-
* Factored out CombinerOptimizer code into a utility class called 
CombinerOptimizerUtil. So both MR and Tez CombinerOptimizer use this utility 
class instead of duplicating code.
* Introduced a new class called TezEdgeDescriptor that holds combine plans as 
well as various edge properties.
* Added TezEdgeDescriptors to TezOperator. Note that I added multiple 
descriptors for inbound edges but a single descriptor for all the outbound 
edges. This is because TezDagBuilder always creates an edge by connecting 
predecessors to the current vertex. Please let me know if you think we should 
allow multiple descriptors for outbound edges too.
* Refactored some code in TezDagBuilder while touching it.


Diffs (updated)
-----

  src/org/apache/pig/PigServer.java c0826ea 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
 18a382b 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRExecType.java
 07d737d 
  src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java 
e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
0b1f3c9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
45e47b0 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java 
e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java 
3f14644 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded 
  
src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
 e69de29 
  test/org/apache/pig/test/TestCombiner.java 6252b51 
  test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e 
  test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe 
  test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b 
  test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903 
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e 
  test/tez-tests c22a448 

Diff: https://reviews.apache.org/r/15261/diff/


Testing
-------

ant test-tez passes.
ant test-e2e-tez passes.

I didn't add new test cases, but an e2e test case (Checkin_3) includes an 
algebraic udf (count) following group-by. I also manually tested it on a live 
cluster.


Thanks,

Cheolsoo Park

Reply via email to