[jira] Commented: (PIG-364) Limit return incorrect records when we use multiple reducer

Shravan Matthur Narayanamurthy (JIRA) Tue, 16 Sep 2008 14:16:06 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631554#action_12631554
 ]


Shravan Matthur Narayanamurthy commented on PIG-364:
----------------------------------------------------

I see a couple of problems with the patch.
1) When you insert the new limitAdjustMROp MapReduceOper into the plan, the 
case where multiple successors might exist is not handled. For instance if 
there is a split after the limit, then you will only insert the limitMROp for 
the first outgoing edge.

2) This is more of a semantic issue. By following the approach in the patch the 
semantics of limit do not hold. Consider the following:
A = load 'URLs' as (url:string, pagerank:double);
B = order A by pagerank parallel 100;
C = limit B 10;
D = foreach C generate url, CRAWL(url);
store D into 'crawledpages';

Here I would expect to crawl only the top 10 pages. However, with the current 
patch, I would probably crawl 1000 pages and trim my result to 10. This might 
not be what users want.

3) If at all we decide to go with this approach afterx fixing 1, it is probably 
a good idea to introduce a limit operator into the map of limitAdjustMROp.

> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
>                 Key: PIG-364
>                 URL: https://issues.apache.org/jira/browse/PIG-364
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: types_branch
>
>         Attachments: PIG-364-2.patch, PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case 
> of n reducer, we will get up to n*k output. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-364) Limit return incorrect records when we use multiple reducer

Reply via email to