Yes, I agree. TOP and SAMPLE are different operators. Haijun
-----Original Message----- From: Daniel Dai (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, June 19, 2008 12:20 PM To: [email protected] Subject: [jira] Commented: (PIG-171) Top K [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606518#action_12606518 ] Daniel Dai commented on PIG-171: -------------------------------- If we use "SAMPLE" instead of "LIMIT" for first k output, people will expect this is a fairly random sample. They may not notice that the sample they've got is just a "first k". To me, it seems to be more confusing. What Pi suggested is a dedicated "SAMPLE" operator. It should be a random sample and should have a different implementation. How do you think? > Top K > ----- > > Key: PIG-171 > URL: https://issues.apache.org/jira/browse/PIG-171 > Project: Pig > Issue Type: New Feature > Reporter: Amir Youssefi > > Frequently, users are interested on Top results (especially Top K rows) . > This can be implemented efficiently in Pig /Map Reduce settings to deliver > rapid results and low Network Bandwidth/Memory usage. > > Key point is to prune all data on the map side and keep only small set of > rows with Top criteria . We can do it in Algebraic function (combiner) with > multiple value output. Only a small data-set gets out of mapper node. > The same idea is applicable to solve variants of this problem: > - An Algebraic Function for 'Top K Rows' > - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense > Rank K') > - TOP K ORDER BY. > Another words implementation is similar to combiners for aggregate functions > but instead of one value we get multiple ones. > I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY > to clarify details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
