[
https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611703#action_12611703
]
Alan Gates commented on PIG-171:
--------------------------------
Daniel, the patch looks good. A few small comments:
1) in LOLimit, I think Santhosh has gone back and changed all the schema
getSchema calls to just check mIsSchemaComputed, removing the check whether
mSchema is null.
2) in POLimit, it's swallowing nulls. I don't think it should. Nulls should
be returned and counted as one of the returns records.
This patch also makes use of the combiner. I want to add general combiner
functionality next week, so I'm going to hold off applying this until I've
figured out in general how I want to push things into the combiner.
> Top K
> -----
>
> Key: PIG-171
> URL: https://issues.apache.org/jira/browse/PIG-171
> Project: Pig
> Issue Type: New Feature
> Reporter: Amir Youssefi
> Attachments: limit1.patch, limit2.patch
>
>
> Frequently, users are interested on Top results (especially Top K rows) .
> This can be implemented efficiently in Pig /Map Reduce settings to deliver
> rapid results and low Network Bandwidth/Memory usage.
>
> Key point is to prune all data on the map side and keep only small set of
> rows with Top criteria . We can do it in Algebraic function (combiner) with
> multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
> - An Algebraic Function for 'Top K Rows'
> - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense
> Rank K')
> - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions
> but instead of one value we get multiple ones.
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY
> to clarify details.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.