[ 
https://issues.apache.org/jira/browse/PHOENIX-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056597#comment-16056597
 ] 

Ethan Wang commented on PHOENIX-3390:
-------------------------------------

Had a discussion with [~swapna] today. Thanks to her time we clarified some 
items regarding this feature. Now I'd like to add some more detail to the 
description of this ticket.

First, HyperLogLog is an algorithm that is only good at one particular task: 
Count of distinct items in a multi-set. It does not help on generic 
user-defined aggregation function, such as APPROX_SUM, APPROX_MAX etc. (The 
APPROX_SUM example Swapna provided above is actually the count of the union of 
two multi-sets, but it's still count)

Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do 
the count on a huge data set (before it is impossible do so, because you have 
to remember each distinct item in memory). So HLL is a "can't-do" to "can-do" 
improvement, not a "slow" to "faster" improvement: It still has to full scan 
through every item, there is no "sampling" going on in the process!

Besides HLL, there do exist other algorithms and their implementations that 
will achieve other types of user defined approximate aggregation. Those works 
such as SuperLogLog, KLL(sketching algorithm by Zohar et al), the work by 
Manku, Rajagopalan et al (1), and the work by Munro and Paterson et al (2).  
Among which, Yahoo's library DataSketches(https://datasketches.github.io. 
thanks for recommending @Rahul G) provides the following Approximate 
aggregations that may be particularly interesting for Phoenix. Besides HLL 
based APPROX_COUNT, there are :
    APPROX_MEDIAN
    APPROX_PERCENTILE (like95 Percentile)
    APPROX_MIN/MAX
    APPROX_KMV (the famous Kth Minimum Value)


=================================================================================
Now, everything above are space-optimized approaches. We could, have a 
different set of user defined functions for approximate aggregation that is 
time-optimized based.  i.e., run faster. 
Such as a use case below: "I'm able to do select count( * ) from A, but I don't 
want to, because full scan of A take too long. Can I just count on 10% sample 
and get a approximate that way?" 
Along this line, the similar approaches would be BlinkDB, PHOENIX-153 
tablesample.

Phoenix community please advice the direction we are going. 

[~jamestaylor][~gjacoby]





(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random 
sampling tech-niques for space efficient online computation of order statistics 
of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.

> Custom UDAF for HyperLogLogPlus
> -------------------------------
>
>                 Key: PHOENIX-3390
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3390
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Swapna Kasula
>            Priority: Minor
>
> With ref # PHOENIX-2069
> Custome UDAF to aggregate/union of Hyperloglog's of a column and returns a 
> Hyperloglog.
> select hllUnion(col1) from table;  //returns a Hyperloglog, which is the 
> union of all hyperloglog's from all rows for column 'col1'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to