LIMIT n is slower than it needs to be
-------------------------------------
Key: HIVE-588
URL: https://issues.apache.org/jira/browse/HIVE-588
Project: Hadoop Hive
Issue Type: Improvement
Reporter: Adam Kramer
SELECT a FROM t LIMIT 10;
...simply prints the output of the first 10 lines of the first file in the
database. That's good.
However,
SELECT function(a) FROM t LIMIT 10;
appears to send all of t to the mappers, runs the function, and and then
returns the first 10 rows from whatever mapper(s) finish first. This is very
slow in some cases!
Appropriate behavior for LIMIT would be to use ONE mapper, and to push files
from the table into that mapper, and then auto-kill the mapper once it has
output 10 rows...just take the first 10 rows and kill the whole task if
necessary. On dying, throw some informative error message like, "Dying
intentionally; LIMIT has been reached." This should be the case even for
TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it
has split out 10, the whole task should die and the 10 should be returned
immediately.
The purpose of LIMIT is not just to have "only one response," but it's also to
speed up queries a whole lot. Running the function over the entire table is a
big waste.
Obviously, when a reduce step is necessary, the whole table will have to be
pushed through mappers and then copied and then sorted--but in those cases,
whenever 10 total rows have been output by any reducer(s), at which point all
reduce tasks should be killed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.