-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19724/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park and Daniel Dai.


Bugs: PIG-3814
    https://issues.apache.org/jira/browse/PIG-3814


Repository: pig


Description
-------

Rank implementation in Tez is different from MR implementation.
  * MR Implementation has 1 map-only job (POCounter) which sets the Current 
taskId at position 0 of tuple and local map task counter at position 1. It also 
emits job Counters for the number of records in that map task. 
JobControlCompiler collects those, calculate offsets and launches the next map 
only job (PORank) with those offset information in the jobconf. 
  * Tez Implementation has 3 vertices. Vertex 1 outputs tuples from POCounter 
to Vertex 3. It also outputs the counters to Vertex 2 which calculates the 
offsets and broadcasts it to Vertex 3.

Other changes made:
   - Changed taskid to be Integer instead of String to reduce memory overhead.

Possible optimizations:
   - POCounter sets the Current taskId at position 0 of tuple and counter at 
position 1. PORank create a new tuple of size-1 to remove the task id and 
copies over the rest which is lot of overhead. We could just set the task id as 
the last element of tuple and remove that from arraylist instead of array copy. 
Will create a separate jira for that.


Diffs
-----

  http://svn.apache.org/repos/asf/pig/branches/tez/ivy/libraries.properties 
1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapReduceCounter.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCounter.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORank.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/POValueOutputTez.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezEdgeDescriptor.java
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/TezTaskConfigurable.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterStatsTez.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/POCounterTez.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/src/org/apache/pig/backend/hadoop/executionengine/tez/operators/PORankTez.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/drivers/TestDriverPig.pm
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/test/e2e/pig/tests/nightly.conf
 1582317 
  
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC20.gld
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/test/data/GoldenFiles/TEZC21.gld
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/branches/tez/test/org/apache/pig/tez/TestTezCompiler.java
 1582317 

Diff: https://reviews.apache.org/r/19724/diff/


Testing
-------

Enabled Rank e2e tests for tez. Except Rank 9 and 11, others pass. Rank 9 has 
some Tez map output data corruption issue. Yet to investigate. Rank 11 is a 
issue with SPLIT and aware of the reason. The output keys need to be updated in 
MultiQueryOptimizerTez after Tez operators have been merged. That is already 
done for POFRJoinTez. But trying to think of a generic way to do this (new 
interfaces to get input keys and output keys), so that we don't have to add 
every operator to MultiQueryOptimizerTez. Will do that in a separate jira.


Thanks,

Rohini Palaniswamy

Reply via email to