[ https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-5041: ------------------------------------ Attachment: PIG-5041-1.patch > RoundRobinPartitioner is not deterministic when order of input records change > ----------------------------------------------------------------------------- > > Key: PIG-5041 > URL: https://issues.apache.org/jira/browse/PIG-5041 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Priority: Critical > Fix For: 0.16.1 > > Attachments: PIG-5041-1.patch > > > Maps can be rerun due to shuffle fetch failures. Half of the reducers can end > up successfully pulling partitions from first run of the map while other half > could pull from the rerun after shuffle fetch failures. If the data is not > partitioned by the Partitioner exactly the same way every time then it could > lead to incorrect results (loss of records and duplicated records). > There is a good probability of order of input records changing > - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted > but values can be in any order as the shuffle and merge depends on the order > in which inputs are fetched. Anything involving FLATTEN can produce different > order of output records. > - With UnorderedKVInput, the records could be in any order depending on > order of shuffle fetch. > RoundRobinPartitioner can partition records differently everytime as order of > input records change which is very bad. We need to get rid of > RoundRobinPartitioner. Since the key is empty whenever we use > RoundRobinPartitioner we need to partitioning based on hashcode of values to > produce consistent partitioning. It adds a lot of performance overhead, but > required for correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)