[ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
wangmeng updated HIVE-7296: --------------------------- Description: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/----Data---Storage--). So, is there anything I can do ? Many Thanks. was: For big data analysis, we often need to do the following query and statistics: 1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV) Now we can use hive-query: Select distinct(id) from TestTable ; 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。 Hive query: select count(1) from TestTable where name=”wangmeng” 3.Heavy Hitters, top-k elements: such as top-100 shops Hive query: select count(1), name from TestTable group by name ; need UDF…… 4.Range Query: for example, to find out the number of users between 20 to 30 Hive query : select count(1) from TestTable where age>20 and age <30 5.Membership Query : for example, whether the user name is already registered? According to the implementation mechanism of hive , it will cost too large memory space and a long query time. However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency. Now , based on some theoretical analysis materials ,I want to do some for these new features so much . I am familiar with hive and hadoop , and I have implemented an efficient storage format based on hive.( https://github.com/sjtufighter/----Data---Storage--). So, is there anything I can do ? Many Thanks. > big data approximate processing at a very low cost based on hive sql > ------------------------------------------------------------------------ > > Key: HIVE-7296 > URL: https://issues.apache.org/jira/browse/HIVE-7296 > Project: Hive > Issue Type: New Feature > Reporter: wangmeng > > For big data analysis, we often need to do the following query and statistics: > 1.Cardinality Estimation, count the number of different elements in the > collection, such as Unique Visitor ,UV) > Now we can use hive-query: > Select distinct(id) from TestTable ; > 2.Frequency Estimation: estimate number of an element is repeated, such as > the site visits of a user 。 > Hive query: select count(1) from TestTable where name=”wangmeng” > 3.Heavy Hitters, top-k elements: such as top-100 shops > Hive query: select count(1), name from TestTable group by name ; need UDF…… > 4.Range Query: for example, to find out the number of users between 20 to 30 > Hive query : select count(1) from TestTable where age>20 and age <30 > 5.Membership Query : for example, whether the user name is already > registered? > According to the implementation mechanism of hive , it will cost too large > memory space and a long query time. > However ,in many cases, we do not need very accurate results and a small > error can be tolerated. In such case , we can use approximate processing > to greatly improve the time and space efficiency. > Now , based on some theoretical analysis materials ,I want to do some for > these new features so much > I am familiar with hive and hadoop , and I have implemented an efficient > storage format based on hive.( > https://github.com/sjtufighter/----Data---Storage--). > So, is there anything I can do ? Many Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)