----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/19724/ -----------------------------------------------------------
(Updated March 27, 2014, 5:30 p.m.)
Review request for pig, Cheolsoo Park and Daniel Dai.
Bugs: PIG-3814
https://issues.apache.org/jira/browse/PIG-3814
Repository: pig
Description
-------
Rank implementation in Tez is different from MR implementation.
* MR Implementation has 1 map-only job (POCounter) which sets the Current
taskId at position 0 of tuple and local map task counter at position 1. It also
emits job Counters for the number of records in that map task.
JobControlCompiler collects those, calculate offsets and launches the next map
only job (PORank) with those offset information in the jobconf.
* Tez Implementation has 3 vertices. Vertex 1 outputs tuples from POCounter
to Vertex 3. It also outputs the counters to Vertex 2 which calculates the
offsets and broadcasts it to Vertex 3.
Common (MR and Tez) Perf optimizations made:
- Changed taskid to be Integer instead of String to reduce memory overhead.
- POCounter sets the Current taskId at position 0 of tuple and counter at
position 1. PORank create a new tuple of size-1 to remove the task id and
copies over the rest which is lot of overhead. Setting the task id as the last
element of tuple and removing that from arraylist instead of doing a copy.
Diffs
-----
http://svn.apache.org/repos/asf/pig/branches/tez/ivy/libraries.properties
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapReduceCounter.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCounter.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORank.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/POValueOutputTez.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezTaskConfigurable.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterStatsTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/PORankTez.java
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/drivers/TestDriverPig.pm
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/tests/nightly.conf
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/TestCombiner.java
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld
1582317
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC20.gld
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC21.gld
PRE-CREATION
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/tez/TestTezCompiler.java
1582317
Diff: https://reviews.apache.org/r/19724/diff/
Testing (updated)
-------
Enabled Rank e2e tests for tez. Except Rank 9 and 11, others pass. Rank 9 has
some Tez map output data corruption issue. Yet to investigate. Rank 11 is a
issue with SPLIT and aware of the reason. The input keys need to be updated in
MultiQueryOptimizerTez after Tez operators have been merged. That is already
done for POFRJoinTez. But trying to think of a generic way to do this (new
interfaces to get input keys and output keys), so that we don't have to add
every operator to MultiQueryOptimizerTez. Will do that in a separate jira.
Thanks,
Rohini Palaniswamy
