[
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598862#action_12598862
]
amirhyoussefi edited comment on PIG-171 at 5/21/08 4:46 PM:
------------------------------------------------------------
As Pi Song mentioned LIMIT n,m (or LIMIT n TO m) doesn't show many optimization
opportunities on map side. We have to send all top m rows from mapper to
reducer.
Now consider we have those sent to one reducer. Next step is to run TOP (m-n+1)
function over results of maps already sent to reducer. We can use the same
MERGER as we had in simple LIMIT but this time make it keep only TOP (m-n+1)
while it merges locally and we are done.
I don't recommend making it more complicated but if somebody is concerned about
chocking one reducer with large m row data-sets we can write a custom version
that combines partial results of the first step to multiple reducers creating
TOP (m-n+1). A second Map Reduce job gets TOP (m-n+1) with single reducer and
writes final results.
I suggest we think about LIMIT n,m as extension of LIMIT n and do that later.
was (Author: amirhyoussefi):
As Pi Song mentioned LIMIT n,m (or LIMIT n TO m) doesn't show many
optimization opportunities on map side. We have to send all top m rows from
mapper to reducer.
Now consider we have those sent to one reducer. Next step is to run BOTTOM
(m-n+1) function over results of maps already sent to reducer. We can have a
BOTTOM AWARE MERGER (which is reverse of results on one node and we are done.
I don't recommend making it more complicated but if somebody is concerned about
chocking one reducer with large m row data-sets we can send results of the
first step to multiple reducers and then running a second Map Reduce job with
TOP (m-n+1) with comparator of -1 * MY_COMPARATOR() and single reducer to get
the final results.
I suggest we think about LIMIT n,m as extension of LIMIT n and do that later.
> Top K
> -----
>
> Key: PIG-171
> URL: https://issues.apache.org/jira/browse/PIG-171
> Project: Pig
> Issue Type: New Feature
> Reporter: Amir Youssefi
> Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) .
> This can be implemented efficiently in Pig /Map Reduce settings to deliver
> rapid results and low Network Bandwidth/Memory usage.
>
> Key point is to prune all data on the map side and keep only small set of
> rows with Top criteria . We can do it in Algebraic function (combiner) with
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
> - An Algebraic Function for 'Top K Rows'
> - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense
> Rank K')
> - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions
> but instead of one value we get multiple ones.
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY
> to clarify details.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.