[jira] Commented: (HIVE-588) LIMIT n is slower than it needs to be

Adam Kramer (JIRA) Tue, 14 Jul 2009 21:27:42 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731263#action_12731263
 ]


Adam Kramer commented on HIVE-588:
----------------------------------

This is because * allows you to output whole rows at a time, while specifying 
columns requires that rows be split and then certain indices returned, hence a 
map job. That's reasonable, but really, this could be optimized as well for 
straight-up selects with no transform necessary.

But at least, when any mapper has printed 10 rows, Hive should print those 10 
rows and kill the rest of the job.

> LIMIT n is slower than it needs to be
> -------------------------------------
>
>                 Key: HIVE-588
>                 URL: https://issues.apache.org/jira/browse/HIVE-588
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Adam Kramer
>
> SELECT a FROM t LIMIT 10;
> ...simply prints the output of the first 10 lines of the first file in the 
> database. That's good.
> However,
> SELECT function(a) FROM t LIMIT 10;
> appears to send all of t to the mappers, runs the function, and and then 
> returns the first 10 rows from whatever mapper(s) finish first. This is very 
> slow in some cases!
> Appropriate behavior for LIMIT would be to use ONE mapper, and to push files 
> from the table into that mapper, and then auto-kill the mapper once it has 
> output 10 rows...just take the first 10 rows and kill the whole task if 
> necessary. On dying, throw some informative error message like, "Dying 
> intentionally; LIMIT has been reached." This should be the case even for 
> TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it 
> has split out 10, the whole task should die and the 10 should be returned 
> immediately.
> The purpose of LIMIT is not just to have "only one response," but it's also 
> to speed up queries a whole lot. Running the function over the entire table 
> is a big waste.
> Obviously, when a reduce step is necessary, the whole table will have to be 
> pushed through mappers and then copied and then sorted--but in those cases, 
> whenever 10 total rows have been output by any reducer(s), at which point all 
> reduce tasks should be killed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-588) LIMIT n is slower than it needs to be

Reply via email to