[
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wangmeng updated HIVE-7296:
---------------------------
Description:
For big data analysis, we often need to do the following query and statistics:
1.Cardinality Estimation, count the number of different elements in the
collection, such as Unique Visitor ,UV)
Now we can use hive-query:
Select distinct(id) from TestTable ;
2.Frequency Estimation: estimate number of an element is repeated, such as the
site visits of a user 。
Hive query: select count(1) from TestTable where name=”wangmeng”
3.Heavy Hitters, top-k elements: such as top-100 shops
Hive query: select count(1), name from TestTable group by name ; need UDF……
4.Range Query: for example, to find out the number of users between 20 to 30
Hive query : select count(1) from TestTable where age>20 and age <30
5.Membership Query : for example, whether the user name is already registered?
According to the implementation mechanism of hive , it will cost too large
memory space and a long query time.
However ,in many cases, we do not need very accurate results and a small error
can be tolerated. In such case , we can use approximate processing to
greatly improve the time and space efficiency.
Now , based on some theoretical analysis materials ,I want to do some for
these new features so much
I am familiar with hive and hadoop , and I have implemented an efficient
storage format based on hive.(
https://github.com/sjtufighter/----Data---Storage--).
So, is there anything I can do ? Many Thanks.
was:
For big data analysis, we often need to do the following query and statistics:
1.Cardinality Estimation, count the number of different elements in the
collection, such as Unique Visitor ,UV)
Now we can use hive-query:
Select distinct(id) from TestTable ;
2.Frequency Estimation: estimate number of an element is repeated, such as the
site visits of a user 。
Hive query: select count(1) from TestTable where name=”wangmeng”
3.Heavy Hitters, top-k elements: such as top-100 shops
Hive query: select count(1), name from TestTable group by name ; need UDF……
4.Range Query: for example, to find out the number of users between 20 to 30
Hive query : select count(1) from TestTable where age>20 and age <30
5.Membership Query : for example, whether the user name is already registered?
According to the implementation mechanism of hive , it will cost too large
memory space and a long query time.
However ,in many cases, we do not need very accurate results and a small error
can be tolerated. In such case , we can use approximate processing to
greatly improve the time and space efficiency.
Now , based on some theoretical analysis materials ,I want to do some for
these new features so much .
I am familiar with hive and hadoop , and I have implemented an efficient
storage format based on hive.(
https://github.com/sjtufighter/----Data---Storage--).
So, is there anything I can do ? Many Thanks.
> big data approximate processing at a very low cost based on hive sql
> ------------------------------------------------------------------------
>
> Key: HIVE-7296
> URL: https://issues.apache.org/jira/browse/HIVE-7296
> Project: Hive
> Issue Type: New Feature
> Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation, count the number of different elements in the
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id) from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as
> the site visits of a user 。
> Hive query: select count(1) from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops
> Hive query: select count(1), name from TestTable group by name ; need UDF……
> 4.Range Query: for example, to find out the number of users between 20 to 30
> Hive query : select count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether the user name is already
> registered?
> According to the implementation mechanism of hive , it will cost too large
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small
> error can be tolerated. In such case , we can use approximate processing
> to greatly improve the time and space efficiency.
> Now , based on some theoretical analysis materials ,I want to do some for
> these new features so much
> I am familiar with hive and hadoop , and I have implemented an efficient
> storage format based on hive.(
> https://github.com/sjtufighter/----Data---Storage--).
> So, is there anything I can do ? Many Thanks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)