[jira] Issue Comment Edited: (HADOOP-4086) Add limit to Hive QL

Joydeep Sen Sarma (JIRA) Sun, 14 Sep 2008 01:20:08 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630855#action_12630855
 ]


jsensarma edited comment on HADOOP-4086 at 9/14/08 1:18 AM:
--------------------------------------------------------------------

some questions:

- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what 
will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in 
the dumbest case - select * limit N - we just need to run the mappers and then 
keep concatenating mapper outputs until we have N rows.
- in the other case where the priot output is sorted/grouped - we need to have 
top-N operator as limit - that merges prior output and gets top N.

based on last 2 observations - i find it much easier to understand the limit 
operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an 
intermediate dataset or not)
- there are two cases:
  - if the table/data is sorted/grouped - then the limit operator needs to do a 
merge of all the tables files and produce top N
  - if the table/data is not sorted/grouped - then the limit task needs to get 
any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.

the limit operator can run in a single mapper map-only hadoop job in case it's 
writing to a file - or if it's writing to console (select * limit N) - can just 
run from the client side. this is orthogonal to what it does.





      was (Author: jsensarma):
    some questions:

- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what 
will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in 
the dumbest case - select * limit N - we just need to run the mappers and then 
keep concatenating mapper outputs until we have N rows.
- in the other case where the output is sorted/grouped - we need to have N from 
each mapper and then limit N in reducer (standard top N operator

based on last 2 observations - i find it much easier to understand the limit 
operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an 
intermediate dataset or not)
- there are two cases:
  - if the table/data is sorted/grouped - then the limit operator needs to do a 
merge of all the tables files and produce top N
  - if the table/data is not sorted/grouped - then the limit task needs to get 
any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.

the limit task can run in a single mapper map-only hadoop job in case it's 
writing to a file - or if it's writing to console (select * limit N) - can just 
run from the client side. this is orthogonal to what it does.




  
> Add limit to Hive QL
> --------------------
>
>                 Key: HADOOP-4086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4086
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hive
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>
> Add a limit feature to the Hive Query language.
> so you can do the following things:
> SELECT * FROM T LIMIT 10;
> and this would just return the 10 rows.
> No gaurantees are made on which 10 rows are returned by the query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4086) Add limit to Hive QL

Reply via email to