[ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050265#comment-14050265
 ] 

Carter Shanklin commented on HIVE-7296:
---------------------------------------

[~sjtufighter]

I have spoken to some Hive users that implemented their own UDF to compute 
approximate counts and ranks using lossy counting 
http://www.vldb.org/conf/2002/S10P03.pdf

They had tried some other approaches but settled on this because it allows 
tunable error and deals with skew fairly well.

This could be implemented in Hive using partitioned table functions and I think 
there are some users who would like this functionality. This sounds similar to 
your number (3). I've spoken to a few people on the Hive team and they think it 
sounds like a good idea, any interest in building this?

> big data approximate processing  at a very  low cost  based on hive sql 
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7296
>                 URL: https://issues.apache.org/jira/browse/HIVE-7296
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the 
> collection, such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as 
> the site visits of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already 
> registered?
> According to the implementation mechanism of hive , it  will cost too large 
> memory space and a long query time.
> However ,in many cases, we do not need very accurate results and a small 
> error can be tolerated. In such case  , we can use  approximate processing  
> to greatly improve the time and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for 
> these new features so much if possible. 
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to