[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

wangmeng (JIRA) Thu, 26 Jun 2014 02:03:36 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wangmeng updated HIVE-7296:
---------------------------

    Description: 
For big data analysis, we often need to do the following query and statistics：

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much  

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/----Data---Storage--).

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics：

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/----Data---Storage--).

So, is there anything I can do ?  Many Thanks.



> big data approximate processing  at a very  low cost  based on hive sql 
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7296
>                 URL: https://issues.apache.org/jira/browse/HIVE-7296
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics：
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much  
> I am familiar with hive and  hadoop , and  I have implemented an efficient  
> storage format based on hive.( 
> https://github.com/sjtufighter/----Data---Storage--).
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

Reply via email to