Rohini Palaniswamy created PIG-5041:
Summary: RoundRobinPartitioner is not deterministic when order of
input records change
Issue Type: Bug
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
Fix For: 0.16.1
Maps can be rerun due to shuffle fetch failures. Half of the reducers can end
up successfully pulling partitions from first run of the map while other half
could pull from the rerun after shuffle fetch failures. If the data is not
partitioned by the Partitioner exactly the same way every time then it could
lead to incorrect results (loss of records and duplicated records).
There is a good probability of order of input records changing
- With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted
but values can be in any order as the shuffle and merge depends on the order in
which inputs are fetched. Anything involving FLATTEN can produce different
order of output records.
- With UnorderedKVInput, the records could be in any order depending on
order of shuffle fetch.
RoundRobinPartitioner can partition records differently everytime as order of
input records change which is very bad. We need to get rid of
RoundRobinPartitioner. Since the key is empty whenever we use
RoundRobinPartitioner we need to partitioning based on hashcode of values to
produce consistent partitioning. It adds a lot of performance overhead, but
required for correctness.
This message was sent by Atlassian JIRA