-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15261/
-----------------------------------------------------------
(Updated Nov. 6, 2013, 11:55 p.m.)
Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini
Palaniswamy.
Changes
-------
Upload a new patch that includes the following changes-
* Adds two Map<OperatorKey, TezEdgeDescriptor>'s to TezOperator.
* Adds combine plans to outbound (map/onfilesortedoutput) instead of inbound
(reduce/shufflemergeinput). This is the same as MR-Pig.
* Adds a few Pig-specific properties to the edge payload to make PigCombiner
work.
I still have to go through Mark's comments, but with this patch, combiners seem
to work now. I can see counters in task logs as follows-
Combine input records=3, Combine output records=8
Bugs: PIG-3555
https://issues.apache.org/jira/browse/PIG-3555
Repository: pig-git
Description
-------
Initial implementation of Tez combiner optimizer. The patch includes the
following changes-
* Factored out CombinerOptimizer code into a utility class called
CombinerOptimizerUtil. So both MR and Tez CombinerOptimizer use this utility
class instead of duplicating code.
* Introduced a new class called TezEdgeDescriptor that holds combine plans as
well as various edge properties.
* Added TezEdgeDescriptors to TezOperator. Note that I added multiple
descriptors for inbound edges but a single descriptor for all the outbound
edges. This is because TezDagBuilder always creates an edge by connecting
predecessors to the current vertex. Please let me know if you think we should
allow multiple descriptors for outbound edges too.
* Refactored some code in TezDagBuilder while touching it.
Diffs (updated)
-----
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
18a382b
src/org/apache/pig/backend/hadoop/executionengine/tez/CombinerOptimizer.java
e69de29
src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
0b1f3c9
src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
45e47b0
src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java
e69de29
src/org/apache/pig/backend/hadoop/executionengine/tez/TezLauncher.java
3f14644
src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java
e612d88
src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java 5a42ded
src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
e69de29
test/org/apache/pig/test/data/GoldenFiles/TEZC1.gld 925f07e
test/org/apache/pig/test/data/GoldenFiles/TEZC2.gld a3974fe
test/org/apache/pig/test/data/GoldenFiles/TEZC3.gld a8c942b
test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld fb7c903
test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld e6cd25e
Diff: https://reviews.apache.org/r/15261/diff/
Testing
-------
ant test-tez passes.
ant test-e2e-tez passes.
I didn't add new test cases, but an e2e test case (Checkin_3) includes an
algebraic udf (count) following group-by. I also manually tested it on a live
cluster.
Thanks,
Cheolsoo Park