Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"
----------------------------------------------------

                 Key: HIVE-404
                 URL: https://issues.apache.org/jira/browse/HIVE-404
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.3.0, 0.4.0
            Reporter: Zheng Shao


Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected 
results with the query of  "SELECT * FROM t SORT BY col1 LIMIT 100"

Basically, in the first map-reduce job, each reducer will get sorted data and 
only keep the first 100. In the second map-reduce job, we will distribute and 
sort the data randomly, before feeding into a single reducer that outputs the 
first 100.

In short, the query will output 100 random records in N * 100 top records from 
each of the reducer in the first map-reduce job.

This is contradicting to what people expects.


We should propagate the SORT BY columns to the second map-reduce job.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to