[ https://issues.apache.org/jira/browse/HIVE-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-404: -------------------------------- Fix Version/s: 0.3.0 (was: 0.4.0) Affects Version/s: (was: 0.3.0) (was: 0.4.0) > Problems in "SELECT * FROM t SORT BY col1 LIMIT 100" > ---------------------------------------------------- > > Key: HIVE-404 > URL: https://issues.apache.org/jira/browse/HIVE-404 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Reporter: Zheng Shao > Assignee: Namit Jain > Fix For: 0.3.0 > > Attachments: hive.404.1.patch, hive.404.2.patch > > > Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected > results with the query of "SELECT * FROM t SORT BY col1 LIMIT 100" > Basically, in the first map-reduce job, each reducer will get sorted data and > only keep the first 100. In the second map-reduce job, we will distribute and > sort the data randomly, before feeding into a single reducer that outputs the > first 100. > In short, the query will output 100 random records in N * 100 top records > from each of the reducer in the first map-reduce job. > This is contradicting to what people expects. > We should propagate the SORT BY columns to the second map-reduce job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.