[
https://issues.apache.org/jira/browse/HIVE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Namit Jain resolved HIVE-588.
-----------------------------
Resolution: Duplicate
Duplicate of http://issues.apache.org/jira/browse/HIVE-908
> LIMIT n is slower than it needs to be
> -------------------------------------
>
> Key: HIVE-588
> URL: https://issues.apache.org/jira/browse/HIVE-588
> Project: Hadoop Hive
> Issue Type: Improvement
> Reporter: Adam Kramer
>
> SELECT a FROM t LIMIT 10;
> ...simply prints the output of the first 10 lines of the first file in the
> database. That's good.
> However,
> SELECT function(a) FROM t LIMIT 10;
> appears to send all of t to the mappers, runs the function, and and then
> returns the first 10 rows from whatever mapper(s) finish first. This is very
> slow in some cases!
> Appropriate behavior for LIMIT would be to use ONE mapper, and to push files
> from the table into that mapper, and then auto-kill the mapper once it has
> output 10 rows...just take the first 10 rows and kill the whole task if
> necessary. On dying, throw some informative error message like, "Dying
> intentionally; LIMIT has been reached." This should be the case even for
> TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it
> has split out 10, the whole task should die and the 10 should be returned
> immediately.
> The purpose of LIMIT is not just to have "only one response," but it's also
> to speed up queries a whole lot. Running the function over the entire table
> is a big waste.
> Obviously, when a reduce step is necessary, the whole table will have to be
> pushed through mappers and then copied and then sorted--but in those cases,
> whenever 10 total rows have been output by any reducer(s), at which point all
> reduce tasks should be killed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.