[jira] Commented: (HIVE-588) LIMIT n is slower than it needs to be

He Yongqiang (JIRA) Tue, 14 Jul 2009 22:25:39 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731277#action_12731277
 ]


He Yongqiang commented on HIVE-588:
-----------------------------------

>>when any mapper has printed 10 rows, Hive should print those 10 rows and kill 
>>the rest of the job
it is not easy to implement with current hadoop map-reduce. You can not kill a 
whole job inside some mapper. 
And hive does not know when enough rows have been collected. 
Even though hive use a done flag to note the mapper's job is finished, the 
split scan for that mapper is not stopped. And new kv pairs are fed in to the 
mapper and are simply discarded when the done flag in mapper is set. 

I think a workaroud would be:
1) file an issue in mapreduce to add the ability to let a mapper and thus 
maprunner terminate itself when some done flag is set
or
2) extend a maprunner in hive to do the same, because issue in map-reduce may 
need a long time to review, commit, release etc.

Even though the ability of terminating mapper is added, we can not share a 
global flag among all mappers. In order to do that, i think one potential 
method would be to narrow the mappers' number and thus a larger split is fed 
for each mapper (we will only need to touch a small piece of the whole split).  

> LIMIT n is slower than it needs to be
> -------------------------------------
>
>                 Key: HIVE-588
>                 URL: https://issues.apache.org/jira/browse/HIVE-588
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Adam Kramer
>
> SELECT a FROM t LIMIT 10;
> ...simply prints the output of the first 10 lines of the first file in the 
> database. That's good.
> However,
> SELECT function(a) FROM t LIMIT 10;
> appears to send all of t to the mappers, runs the function, and and then 
> returns the first 10 rows from whatever mapper(s) finish first. This is very 
> slow in some cases!
> Appropriate behavior for LIMIT would be to use ONE mapper, and to push files 
> from the table into that mapper, and then auto-kill the mapper once it has 
> output 10 rows...just take the first 10 rows and kill the whole task if 
> necessary. On dying, throw some informative error message like, "Dying 
> intentionally; LIMIT has been reached." This should be the case even for 
> TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it 
> has split out 10, the whole task should die and the 10 should be returned 
> immediately.
> The purpose of LIMIT is not just to have "only one response," but it's also 
> to speed up queries a whole lot. Running the function over the entire table 
> is a big waste.
> Obviously, when a reduce step is necessary, the whole table will have to be 
> pushed through mappers and then copied and then sorted--but in those cases, 
> whenever 10 total rows have been output by any reducer(s), at which point all 
> reduce tasks should be killed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-588) LIMIT n is slower than it needs to be

Reply via email to