[ 
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625601#action_12625601
 ] 

Daniel Dai commented on PIG-171:
--------------------------------

Hi, Amir,
Thanks for your suggestion. Here is my concern:

1. To find an optimal setting automatically, we need to do experiments with 
different settings on various clusters. Seems not possible for now. So 
currently, we will take the parameter from user. User has to tell pig if they 
wish to use multi-stage, and number of stages / fan-out. Since we do not have 
pig.property now, shall we put it on command line?

2. It sounds to me this issue is more general in nature. Many map-reduce can be 
divided into multi-stages. So to which layer should this issue to be addressed? 
In limit operator, pig layer or map-reduce layer?

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: limit1.patch, limit2.patch, limit3.patch
>
>
> Frequently, users are interested on Top results (especially Top K rows) . 
> This can be implemented efficiently in Pig /Map Reduce settings to deliver 
> rapid results and low Network Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of 
> rows with Top criteria . We can do it in Algebraic function (combiner) with 
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense 
> Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions 
> but instead of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY 
> to clarify details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to