-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15261/
-----------------------------------------------------------
(Updated Nov. 8, 2013, 7:57 a.m.)
Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini
Palaniswamy.
Changes
-------
Updated patch includes the following improvements-
# CombinerOptimizer stores TezOperPlan (parent plan of TezOperators), so I can
retrieve predecessor vertices via getPredecessors(). To detect the
POLocalRearrange -> POPackage pattern, 1) I find POPackage in the current
vertex, and 2) I look up predecessors to find POLocalRearrange. If I find them
in both 1 and 2, I add combiners.
# Combine plan is set to both input and output descriptors (as per Rohini's
suggestion). Since they're identical, I do not print them twice in the explain
output.
# TestCombiner unit tests pass with TezMiniCluster. The unit tests remain
almost unchanged except that 1) PigServer dynamically takes a exec type via a
system property, and 2) PigServer.shutdown() is called after each test case to
delete temporary files.
Only remaining issues that I haven't addressed are what's related to multiple
input/outputs. I am wondering whether we can resolve them in a separate jira
given that Mark and Rohini are actively working on them. What do others think?
Bugs: PIG-3555
https://issues.apache.org/jira/browse/PIG-3555
Repository: pig-git
Description
-------
Initial implementation of Tez combiner optimizer. The patch includes the
following changes-
* Factored out CombinerOptimizer code into a utility class called
CombinerOptimizerUtil. So both MR and Tez CombinerOptimizer use this utility
class instead of duplicating code.
* Introduced a new class called TezEdgeDescriptor that holds combine plans as
well as various edge properties.
* Added TezEdgeDescriptors to TezOperator. Note that I added multiple
descriptors for inbound edges but a single descriptor for all the outbound
edges. This is because TezDagBuilder always creates an edge by connecting
predecessors to the current vertex. Please let me know if you think we should
allow multiple descriptors for outbound edges too.
* Refactored some code in TezDagBuilder while touching it.
Diffs (updated)
-----
src/org/apache/pig/PigServer.java c0826ea
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
18a382b
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRExecType.java
07d737d
src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java
e69de29
src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
0b1f3c9
src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
45e47b0
src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java
e69de29
src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java
3f14644
src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded
src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
e69de29
test/org/apache/pig/test/TestCombiner.java 6252b51
test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e
test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe
test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b
test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903
test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e
test/tez-tests c22a448
Diff: https://reviews.apache.org/r/15261/diff/
Testing
-------
ant test-tez passes.
ant test-e2e-tez passes.
I didn't add new test cases, but an e2e test case (Checkin_3) includes an
algebraic udf (count) following group-by. I also manually tested it on a live
cluster.
Thanks,
Cheolsoo Park