[ 
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606136#action_12606136
 ] 

Daniel Dai commented on PIG-171:
--------------------------------

I am working on the implementation of "LIMIT". There are two usecases for 
"LIMIT": sampling and top k. We can leave "LIMIT" as a "quick and dirty" 
approach to get sample. Although this sample is not perfect, it is cheaper. We 
can treat a dedicated "SAMPLE" statement as a seperate case if people ask for 
it.
Current design of "LIMIT" is:
# There is a single interface to the user, eg,
   B = limit A 100;
   We generate LOLimit for this statement
# Logical layer optimizor will try to push LOLimit up, if there is a LOSort in 
front of LOLimit, then merge LOLimit into LOSort, put a hint to LOSort. LOSort 
will do a "limited sort" in this case
# In physical layer, we have LOLimit->POLimit and LOSort->POSort relating to 
this new operation
   #* POLimit will retrieve first k samples
   #* POSort will do a limited sort if there is a "limit" hint, optimize the 
internal process of sorting
# In map-reduce
   #* POLimit can be put in 
      \* Reduce plan, or
      \* Both map plan and reduce plan to speed up more
    If map plan is done, simply put POLimit into reduce plan. If not, add 
POLimit into map plan, close map plan, and then put another POLimit into reduce 
plan. If this is the case, we could increase the number of map-reduce 
operations by 1
   #* POSort is compiled into two map-reduce operations: getQuantile and sort. 
There is no change in the first map-reduce, for the second, we will:
     \* Keep top k records in each mapper. We should put this code in the 
combine-plan, and
     \* Put POLimit in the reduce plan
Also we need to take the new hadoop 18 combiner behavior into account 
(https://issues.apache.org/jira/browse/PIG-274)

How are others thinking?

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . 
> This can be implemented efficiently in Pig /Map Reduce settings to deliver 
> rapid results and low Network Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of 
> rows with Top criteria . We can do it in Algebraic function (combiner) with 
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense 
> Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions 
> but instead of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY 
> to clarify details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to