[
https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699863#action_12699863
]
Zheng Shao commented on HIVE-404:
---------------------------------
I think the users would expect the results of LIMIT to be sorted in total order
- if user says "SORT BY key LIMIT 10", he probably wants the global top 10, no
matter how many reducers we have.
I think it's necessary to have the second map-reduce job in case of "SORT
BY/CLUSTER BY", but we also want the second map-reduce job to have the right
sort cols between the map-reduce boundary so we can get the global top ones.
> Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
> ----------------------------------------------------
>
> Key: HIVE-404
> URL: https://issues.apache.org/jira/browse/HIVE-404
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.3.0, 0.4.0
> Reporter: Zheng Shao
> Assignee: Namit Jain
> Attachments: hive.404.1.patch, hive.404.2.patch
>
>
> Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected
> results with the query of "SELECT * FROM t SORT BY col1 LIMIT 100"
> Basically, in the first map-reduce job, each reducer will get sorted data and
> only keep the first 100. In the second map-reduce job, we will distribute and
> sort the data randomly, before feeding into a single reducer that outputs the
> first 100.
> In short, the query will output 100 random records in N * 100 top records
> from each of the reducer in the first map-reduce job.
> This is contradicting to what people expects.
> We should propagate the SORT BY columns to the second map-reduce job.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.