[ 
https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630524#action_12630524
 ] 

lengwuqing commented on HADOOP-3601:
------------------------------------

TITLE:   I think the approach of group-by has some issue:

Hi, Guys:
    I create a table like this:
    CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes 
STRING);
    And I wanted to try the Group-By like this:
    INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM 
course t1 GROUP BY t1.name;

    I found that Hive NOT ONLY can not compute out a correct result , BUT ALSO 
the time cost is very hign.  I insert some diagnostic code into 
ExecRecude.java, I found that:  all of records have been process ONLY in one or 
two reducer.  I noticed that  there are some commnets in Hive: the Group-By is 
based on hash, I can make sure that the 'Name' column are difference.

    I developed a system named TING (www.sadbit.com), which used quite 
different  methods to implemnt a parallel/distributed database, even it is not 
mature currently, but I can estimate that: in my hardware enviroment,to process 
that dataset, It need be faster than 2 minutes. but Hive use more than 9 
minutes.  
    I biggest issue is: all data are processed on 1-2 nodes while reducing, 
even the reduce number is 24.

    Any guy give me some commnets, why?


> Hive as a contrib project
> -------------------------
>
>                 Key: HADOOP-3601
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3601
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: contrib/hive
>    Affects Versions: 0.19.0
>         Environment: N/A
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ashish Thusoo
>            Priority: Minor
>             Fix For: 0.19.0
>
>         Attachments: ant.log, hive.tgz, hive.tgz, hive.tgz, HiveTutorial.pdf
>
>   Original Estimate: 1080h
>  Remaining Estimate: 1080h
>
> Hive is a data warehouse built on top of flat files (stored primarily in 
> HDFS). It includes:
> - Data Organization into Tables with logical and hash partitioning
> - A Metastore to store metadata about Tables/Partitions etc
> - A SQL like query language over object data stored in Tables
> - DDL commands to define and load external data into tables
> Hive's query language is executed using Hadoop map-reduce as the execution 
> engine. Queries can use either single stage or multi-stage map-reduce. Hive 
> has a native format for tables - but can handle any data set (for example 
> json/thrift/xml) using an IO library framework.
> Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and 
> may use Apache Derby as an embedded database for MetaStore. Antlr has a BSD 
> license and should be compatible with Apache license.
> We are currently thinking of contributing to the 0.17 branch as a contrib 
> project (since that is the version under which it will get tested internally) 
> - but looking for advice on the best release path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to