[
https://issues.apache.org/jira/browse/HIVE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731263#action_12731263
]
Adam Kramer commented on HIVE-588:
----------------------------------
This is because * allows you to output whole rows at a time, while specifying
columns requires that rows be split and then certain indices returned, hence a
map job. That's reasonable, but really, this could be optimized as well for
straight-up selects with no transform necessary.
But at least, when any mapper has printed 10 rows, Hive should print those 10
rows and kill the rest of the job.
> LIMIT n is slower than it needs to be
> -------------------------------------
>
> Key: HIVE-588
> URL: https://issues.apache.org/jira/browse/HIVE-588
> Project: Hadoop Hive
> Issue Type: Improvement
> Reporter: Adam Kramer
>
> SELECT a FROM t LIMIT 10;
> ...simply prints the output of the first 10 lines of the first file in the
> database. That's good.
> However,
> SELECT function(a) FROM t LIMIT 10;
> appears to send all of t to the mappers, runs the function, and and then
> returns the first 10 rows from whatever mapper(s) finish first. This is very
> slow in some cases!
> Appropriate behavior for LIMIT would be to use ONE mapper, and to push files
> from the table into that mapper, and then auto-kill the mapper once it has
> output 10 rows...just take the first 10 rows and kill the whole task if
> necessary. On dying, throw some informative error message like, "Dying
> intentionally; LIMIT has been reached." This should be the case even for
> TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it
> has split out 10, the whole task should die and the 10 should be returned
> immediately.
> The purpose of LIMIT is not just to have "only one response," but it's also
> to speed up queries a whole lot. Running the function over the entire table
> is a big waste.
> Obviously, when a reduce step is necessary, the whole table will have to be
> pushed through mappers and then copied and then sorted--but in those cases,
> whenever 10 total rows have been output by any reducer(s), at which point all
> reduce tasks should be killed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.