[
https://issues.apache.org/jira/browse/PIG-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-5040:
------------------------------------
Attachment: PIG-5040-1.patch
PIG-5040-1-nowhitespacechanges.patch
> Order by and CROSS partitioning is not deterministic due to usage of Random
> ---------------------------------------------------------------------------
>
> Key: PIG-5040
> URL: https://issues.apache.org/jira/browse/PIG-5040
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Priority: Critical
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5040-1-nowhitespacechanges.patch, PIG-5040-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end
> up successfully pulling partitions from first run of the map while other half
> could pull from the rerun after shuffle fetch failures. If the data is not
> partitioned by the Partitioner exactly the same way every time then it could
> lead to incorrect results (loss of records and duplicated records). Even
> though issue has existed for 8 years now with order by and affects mapreduce
> as well found this with Tez where the frequency of rerun due to shuffle fetch
> failures is high (Order by partitioner gets its data from a 1-1 edge, so
> there are no retries and shuffle fetch failures trigger a rerun immediately).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)