[ 
https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699473#action_12699473
 ] 

Zheng Shao commented on HIVE-404:
---------------------------------

1. I think the condition on "distributedBy" is not needed. clusterBy = 
distributeBy and sortBy. distributeBy does not enforce the order.

2. We need to upgrade the FetchTask to be able to merge multiple sorted stream. 
This may not be good because there might be thousands of files needed to be 
opened by a single client. This also does NOT solve the problem when the result 
is inserted into a table.

An alternative to 2 is to propagate the sort order to the second map-reduce 
job. I think that will solve the problem.



> Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
> ----------------------------------------------------
>
>                 Key: HIVE-404
>                 URL: https://issues.apache.org/jira/browse/HIVE-404
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Zheng Shao
>            Assignee: Namit Jain
>         Attachments: hive.404.1.patch
>
>
> Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected 
> results with the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"
> Basically, in the first map-reduce job, each reducer will get sorted data and 
> only keep the first 100. In the second map-reduce job, we will distribute and 
> sort the data randomly, before feeding into a single reducer that outputs the 
> first 100.
> In short, the query will output 100 random records in N * 100 top records 
> from each of the reducer in the first map-reduce job.
> This is contradicting to what people expects.
> We should propagate the SORT BY columns to the second map-reduce job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to