[ https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630524#action_12630524 ]
lengwuqing commented on HADOOP-3601: ------------------------------------ TITLE: I think the approach of group-by has some issue: Hi, Guys: I create a table like this: CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes STRING); And I wanted to try the Group-By like this: INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM course t1 GROUP BY t1.name; I found that Hive NOT ONLY can not compute out a correct result , BUT ALSO the time cost is very hign. I insert some diagnostic code into ExecRecude.java, I found that: all of records have been process ONLY in one or two reducer. I noticed that there are some commnets in Hive: the Group-By is based on hash, I can make sure that the 'Name' column are difference. I developed a system named TING (www.sadbit.com), which used quite different methods to implemnt a parallel/distributed database, even it is not mature currently, but I can estimate that: in my hardware enviroment,to process that dataset, It need be faster than 2 minutes. but Hive use more than 9 minutes. I biggest issue is: all data are processed on 1-2 nodes while reducing, even the reduce number is 24. Any guy give me some commnets, why? > Hive as a contrib project > ------------------------- > > Key: HADOOP-3601 > URL: https://issues.apache.org/jira/browse/HADOOP-3601 > Project: Hadoop Core > Issue Type: Wish > Components: contrib/hive > Affects Versions: 0.19.0 > Environment: N/A > Reporter: Joydeep Sen Sarma > Assignee: Ashish Thusoo > Priority: Minor > Fix For: 0.19.0 > > Attachments: ant.log, hive.tgz, hive.tgz, hive.tgz, HiveTutorial.pdf > > Original Estimate: 1080h > Remaining Estimate: 1080h > > Hive is a data warehouse built on top of flat files (stored primarily in > HDFS). It includes: > - Data Organization into Tables with logical and hash partitioning > - A Metastore to store metadata about Tables/Partitions etc > - A SQL like query language over object data stored in Tables > - DDL commands to define and load external data into tables > Hive's query language is executed using Hadoop map-reduce as the execution > engine. Queries can use either single stage or multi-stage map-reduce. Hive > has a native format for tables - but can handle any data set (for example > json/thrift/xml) using an IO library framework. > Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and > may use Apache Derby as an embedded database for MetaStore. Antlr has a BSD > license and should be compatible with Apache license. > We are currently thinking of contributing to the 0.17 branch as a contrib > project (since that is the version under which it will get tested internally) > - but looking for advice on the best release path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.