[
https://issues.apache.org/jira/browse/PHOENIX-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056597#comment-16056597
]
Ethan Wang commented on PHOENIX-3390:
-------------------------------------
Had a discussion with [~swapna] today. Thanks to her time we clarified some
items regarding this feature. Now I'd like to add some more detail to the
description of this ticket.
First, HyperLogLog is an algorithm that is only good at one particular task:
Count of distinct items in a multi-set. It does not help on generic
user-defined aggregation function, such as APPROX_SUM, APPROX_MAX etc. (The
APPROX_SUM example Swapna provided above is actually the count of the union of
two multi-sets, but it's still count)
Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do
the count on a huge data set (before it is impossible do so, because you have
to remember each distinct item in memory). So HLL is a "can't-do" to "can-do"
improvement, not a "slow" to "faster" improvement: It still has to full scan
through every item, there is no "sampling" going on in the process!
Besides HLL, there do exist other algorithms and their implementations that
will achieve other types of user defined approximate aggregation. Those works
such as SuperLogLog, KLL(sketching algorithm by Zohar et al), the work by
Manku, Rajagopalan et al (1), and the work by Munro and Paterson et al (2).
Among which, Yahoo's library DataSketches(https://datasketches.github.io.
thanks for recommending @Rahul G) provides the following Approximate
aggregations that may be particularly interesting for Phoenix. Besides HLL
based APPROX_COUNT, there are :
APPROX_MEDIAN
APPROX_PERCENTILE (like95 Percentile)
APPROX_MIN/MAX
APPROX_KMV (the famous Kth Minimum Value)
=================================================================================
Now, everything above are space-optimized approaches. We could, have a
different set of user defined functions for approximate aggregation that is
time-optimized based. i.e., run faster.
Such as a use case below: "I'm able to do select count( * ) from A, but I don't
want to, because full scan of A take too long. Can I just count on 10% sample
and get a approximate that way?"
Along this line, the similar approaches would be BlinkDB, PHOENIX-153
tablesample.
Phoenix community please advice the direction we are going.
[~jamestaylor][~gjacoby]
(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random
sampling tech-niques for space efficient online computation of order statistics
of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.
> Custom UDAF for HyperLogLogPlus
> -------------------------------
>
> Key: PHOENIX-3390
> URL: https://issues.apache.org/jira/browse/PHOENIX-3390
> Project: Phoenix
> Issue Type: New Feature
> Reporter: Swapna Kasula
> Priority: Minor
>
> With ref # PHOENIX-2069
> Custome UDAF to aggregate/union of Hyperloglog's of a column and returns a
> Hyperloglog.
> select hllUnion(col1) from table; //returns a Hyperloglog, which is the
> union of all hyperloglog's from all rows for column 'col1'
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)