[
https://issues.apache.org/jira/browse/PIG-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833204#action_12833204
]
Richard Ding commented on PIG-1169:
-----------------------------------
The cause of this bug is that, because the output of orderby statement is
consumed by both store (to 'full') and limit statements, the limit operation
ends up in a subsequent MR job. Since data cross MR job boundary isn't
order-preserving, the script produces the wrong TopN results.
The proposed solution is to replace the limit operator with an
orderby-with-limit operator, i.e. the orderby operation runs twice. This
ensures the correctness of the TopN results.
> Top-N queries produce incorrect results when a store statement is added
> between order by and limit statement
> ------------------------------------------------------------------------------------------------------------
>
> Key: PIG-1169
> URL: https://issues.apache.org/jira/browse/PIG-1169
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.7.0
>
>
> ??We tried to get top N results after a groupby and sort, and got different
> results with or without storing the full sorted results. Here is a skeleton
> of our pig script.??
> {code}
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
> {code}
> ??With the statement 'store ordered ...', top N results are incorrect, but
> without the statement, results are correct. Has anyone seen this before? I
> know a similar bug has been fixed in the multi-query release. We are on pig
> .4 and hadoop .20.1.??
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.