[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5041:
------------------------------------
    Description: 
Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
up successfully pulling partitions from first run of the map while other half 
could pull from the rerun after shuffle fetch failures. If the data is not 
partitioned by the Partitioner exactly the same way every time then it could 
lead to incorrect results (loss of records and duplicated records). 

There is a good probability of order of input records changing
    - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
but values can be in any order as the shuffle and merge depends on the order in 
which inputs are fetched. Anything involving FLATTEN can produce different 
order of output records.
    - With UnorderedKVInput, the records could be in any order depending on 
order of shuffle fetch. 

RoundRobinPartitioner can partition records differently everytime as order of 
input records change which is very bad. We need to get rid of 
RoundRobinPartitioner. Since the key is empty whenever we use  
RoundRobinPartitioner we need to partitioning based on hashcode of values to 
produce consistent partitioning.

Partitioning based on hashcode is required for correctness, but disadvantage is 
that it
    - adds a lot of performance overhead with hashcode computation
    - with the random distribution due to hashcode (as opposed to batched round 
robin) input records sorted on some column could get distributed to different 
reducers and if union is followed by a store, the output can have bad 
compression.


  was:
Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
up successfully pulling partitions from first run of the map while other half 
could pull from the rerun after shuffle fetch failures. If the data is not 
partitioned by the Partitioner exactly the same way every time then it could 
lead to incorrect results (loss of records and duplicated records). 

There is a good probability of order of input records changing
    - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
but values can be in any order as the shuffle and merge depends on the order in 
which inputs are fetched. Anything involving FLATTEN can produce different 
order of output records.
    - With UnorderedKVInput, the records could be in any order depending on 
order of shuffle fetch. 

RoundRobinPartitioner can partition records differently everytime as order of 
input records change which is very bad. We need to get rid of 
RoundRobinPartitioner. Since the key is empty whenever we use  
RoundRobinPartitioner we need to partitioning based on hashcode of values to 
produce consistent partitioning. It adds a lot of performance overhead, but 
required for correctness.



> RoundRobinPartitioner is not deterministic when order of input records change
> -----------------------------------------------------------------------------
>
>                 Key: PIG-5041
>                 URL: https://issues.apache.org/jira/browse/PIG-5041
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Critical
>             Fix For: 0.16.1
>
>         Attachments: PIG-5041-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
>     - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
>     - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
>     - adds a lot of performance overhead with hashcode computation
>     - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to