Rohini Palaniswamy created PIG-5040:
---------------------------------------

             Summary: Order by and CROSS partitioning is not deterministic due 
to usage of Random
                 Key: PIG-5040
                 URL: https://issues.apache.org/jira/browse/PIG-5040
             Project: Pig
          Issue Type: Bug
            Reporter: Rohini Palaniswamy
            Assignee: Rohini Palaniswamy
            Priority: Critical
             Fix For: 0.17.0, 0.16.1


Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
up successfully pulling partitions from first run of the map while other half 
could pull from the rerun after shuffle fetch failures. If the data is not 
partitioned by the Partitioner exactly the same way every time then it could 
lead to incorrect results (loss of records and duplicated records). Even though 
issue has existed for 8 years now with order by and affects mapreduce as well 
found this with Tez where the frequency of rerun due to shuffle fetch failures 
is high (Order by partitioner gets its data from a 1-1 edge, so there are no 
retries and shuffle fetch failures trigger a rerun immediately).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to