-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19724/
-----------------------------------------------------------
(Updated March 27, 2014, 5:27 p.m.)
Review request for pig, Cheolsoo Park and Daniel Dai.
Changes
-------
Fixed TestCombiner failure and TestTezCompiler failure for union after the skew
fix.
Also optimized Optimize PORank by avoiding tuple copy in this jira itself as it
was simple enough change. Ran Rank e2e tests for both MR (Rank_8 failed due to
multiquery issue to be fixed by Mark's POPackage refactor) and Tez. Also ran
TestRank1,TestRank2,TestRank3 (local mode) to ensure this change does not cause
failures.
Bugs: PIG-3814
https://issues.apache.org/jira/browse/PIG-3814
Repository: pig
Description (updated)
-------
Rank implementation in Tez is different from MR implementation.
* MR Implementation has 1 map-only job (POCounter) which sets the Current
taskId at position 0 of tuple and local map task counter at position 1. It also
emits job Counters for the number of records in that map task.
JobControlCompiler collects those, calculate offsets and launches the next map
only job (PORank) with those offset information in the jobconf.
* Tez Implementation has 3 vertices. Vertex 1 outputs tuples from POCounter
to Vertex 3. It also outputs the counters to Vertex 2 which calculates the
offsets and broadcasts it to Vertex 3.
Common (MR and Tez) Perf optimizations made:
- Changed taskid to be Integer instead of String to reduce memory overhead.
- POCounter sets the Current taskId at position 0 of tuple and counter at
position 1. PORank create a new tuple of size-1 to remove the task id and
copies over the rest which is lot of overhead. Setting the task id as the last
element of tuple and removing that from arraylist instead of doing a copy.
Diffs (updated)
-----
http://svn.apache.org/repos/asf/pig/branches/tez/ivy/libraries.properties
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapReduceCounter.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCounter.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORank.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/POValueOutputTez.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezTaskConfigurable.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterStatsTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/PORankTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/drivers/TestDriverPig.pm
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/tests/nightly.conf
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/TestCombiner.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC20.gld
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC21.gld
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/tez/TestTezCompiler.java
1582317
Diff: https://reviews.apache.org/r/19724/diff/
Testing
-------
Enabled Rank e2e tests for tez. Except Rank 9 and 11, others pass. Rank 9 has
some Tez map output data corruption issue. Yet to investigate. Rank 11 is a
issue with SPLIT and aware of the reason. The output keys need to be updated in
MultiQueryOptimizerTez after Tez operators have been merged. That is already
done for POFRJoinTez. But trying to think of a generic way to do this (new
interfaces to get input keys and output keys), so that we don't have to add
every operator to MultiQueryOptimizerTez. Will do that in a separate jira.
Thanks,
Rohini Palaniswamy