[
https://issues.apache.org/jira/browse/PIG-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-3814:
------------------------------------
Status: Patch Available (was: Open)
Rank implementation in Tez is different from MR implementation.
* MR Implementation has 1 map-only job (POCounter) which sets the Current
taskId at position 0 of tuple and local map task counter at position 1. It also
emits job Counters for the number of records in that map task.
JobControlCompiler collects those, calculate offsets and launches the next map
only job (PORank) with those offset information in the jobconf.
* Tez Implementation has 3 vertices. Vertex 1 outputs tuples from POCounter
to Vertex 3. It also outputs the counters to Vertex 2 which calculates the
offsets and broadcasts it to Vertex 3.
Common (MR and Tez) Perf optimizations made:
- Changed taskid to be Integer instead of String to reduce memory overhead.
- POCounter sets the Current taskId at position 0 of tuple and counter at
position 1. PORank create a new tuple of size-1 to remove the task id and
copies over the rest which is lot of overhead. Setting the task id as the last
element of tuple and removing that from arraylist instead of doing a copy.
> Implement RANK in Tez
> ---------------------
>
> Key: PIG-3814
> URL: https://issues.apache.org/jira/browse/PIG-3814
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: tez-branch
>
> Attachments: PIG-3814-1.patch, PIG-3814-2.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)