[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5041:
------------------------------------
    Attachment: PIG-5041-1.patch

> RoundRobinPartitioner is not deterministic when order of input records change
> -----------------------------------------------------------------------------
>
>                 Key: PIG-5041
>                 URL: https://issues.apache.org/jira/browse/PIG-5041
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Critical
>             Fix For: 0.16.1
>
>         Attachments: PIG-5041-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
>     - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
>     - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning. It adds a lot of performance overhead, but 
> required for correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to