[
https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631554#action_12631554
]
Shravan Matthur Narayanamurthy commented on PIG-364:
----------------------------------------------------
I see a couple of problems with the patch.
1) When you insert the new limitAdjustMROp MapReduceOper into the plan, the
case where multiple successors might exist is not handled. For instance if
there is a split after the limit, then you will only insert the limitMROp for
the first outgoing edge.
2) This is more of a semantic issue. By following the approach in the patch the
semantics of limit do not hold. Consider the following:
A = load 'URLs' as (url:string, pagerank:double);
B = order A by pagerank parallel 100;
C = limit B 10;
D = foreach C generate url, CRAWL(url);
store D into 'crawledpages';
Here I would expect to crawl only the top 10 pages. However, with the current
patch, I would probably crawl 1000 pages and trim my result to 10. This might
not be what users want.
3) If at all we decide to go with this approach afterx fixing 1, it is probably
a good idea to introduce a limit operator into the map of limitAdjustMROp.
> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
> Key: PIG-364
> URL: https://issues.apache.org/jira/browse/PIG-364
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Fix For: types_branch
>
> Attachments: PIG-364-2.patch, PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case
> of n reducer, we will get up to n*k output.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.