[jira] Commented: (PIG-171) Top K

Alan Gates (JIRA) Fri, 28 Mar 2008 10:46:42 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583106#action_12583106
 ]


Alan Gates commented on PIG-171:
--------------------------------

A few questions/comments:

1) In your example of TOP(123, A) rows, what does the A mean?

2) I don't understand the differentiation between the three bullet points you 
give in the description.  Could you elaborate and give examples of how each 
would be used?

3) You propose doing this as a UDF, but that only gives you some of what we 
really want.  This will allow pig to use the combiner.  Eventually, to offer 
full functionality, we'll want to be able to do this on non-grouped/ordered 
data (just being able to see the first X records of a file is great for 
expirementation and development).  This doesn't mean we can't support as a UDF 
for now, and promote it later.  But it does mean we need to think carefully 
about how we want to do it.

4) You're counting on using the combiner to make this efficient.  But in the 
current implementation the combiner won't be used except in very specific 
circumstances (a group by followed by a foreach that includes the group).  
General use of the combiner won't be in place until the pipeline rework is 
ready.

5) Syntax question, do we want to use TOPK or LIMIT?  I tend to think of TOPK 
as implying top results of an aggregation, vs LIMIT just meaning a certain 
number of rows, not necessarily implying any grouping.  Maybe others don't use 
this distinction.  LIMIT also allows an offset (give me rows 10000-20000) in 
addition to allowing just the first X rows.  I don't care which we use, but it 
seems like we ought to discuss it in case some people have strong views one way 
or another.

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . 
> This can be implemented efficiently in Pig /Map Reduce settings to deliver 
> rapid results and low Network Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of 
> rows with Top criteria . We can do it in Algebraic function (combiner) with 
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense 
> Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions 
> but instead of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY 
> to clarify details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-171) Top K

Reply via email to