[ https://issues.apache.org/jira/browse/HADOOP-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630855#action_12630855 ]
jsensarma edited comment on HADOOP-4086 at 9/14/08 1:18 AM: -------------------------------------------------------------------- some questions: - The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on? - in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows. - in the other case where the priot output is sorted/grouped - we need to have top-N operator as limit - that merges prior output and gets top N. based on last 2 observations - i find it much easier to understand the limit operator implementation as: - a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not) - there are two cases: - if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N - if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time the limit operator is sequential by definition. the limit operator can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does. was (Author: jsensarma): some questions: - The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on? - in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows. - in the other case where the output is sorted/grouped - we need to have N from each mapper and then limit N in reducer (standard top N operator based on last 2 observations - i find it much easier to understand the limit operator implementation as: - a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not) - there are two cases: - if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N - if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time the limit operator is sequential by definition. the limit task can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does. > Add limit to Hive QL > -------------------- > > Key: HADOOP-4086 > URL: https://issues.apache.org/jira/browse/HADOOP-4086 > Project: Hadoop Core > Issue Type: New Feature > Components: contrib/hive > Reporter: Ashish Thusoo > Assignee: Ashish Thusoo > > Add a limit feature to the Hive Query language. > so you can do the following things: > SELECT * FROM T LIMIT 10; > and this would just return the 10 rows. > No gaurantees are made on which 10 rows are returned by the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.