[ 
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603362#action_12603362
 ] 

Pi Song commented on PIG-171:
-----------------------------

Ted (From mailing-list):
bq. An efficient implementation of top K without full histogramming would still 
be very, very useful.

Logically (not by experience) I still concern about TOP K without order. Does 
this thing really have a good use? The formal definition of TOP K always goes 
with scoring function. Naturally, we also say we want TOP K order by something.

The only use case that I would think people might be doing TOP K without order 
is just to work with sample data. But then doing TOP K is not gonna give a 
statistically good representation. My idea is that it should be better if we 
design the language by not allowing people to do the wrong thing.

If people want to do approximate queries I think we'd better provide a proper 
way like adding:-

{code}
X = SAMPLE 10% OF A ;
Y = SAMPLE 100 OF B ;
{code}

What do you think?

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Amir Youssefi
>            Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . 
> This can be implemented efficiently in Pig /Map Reduce settings to deliver 
> rapid results and low Network Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of 
> rows with Top criteria . We can do it in Algebraic function (combiner) with 
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense 
> Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions 
> but instead of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY 
> to clarify details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to