[
https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Wang updated PHOENIX-4160:
--------------------------------
Comment: was deleted
(was: [~jamestaylor] [~lhofhansl])
> research for a proper hash size set for APPROX_COUNT_DISTINCT
> -------------------------------------------------------------
>
> Key: PHOENIX-4160
> URL: https://issues.apache.org/jira/browse/PHOENIX-4160
> Project: Phoenix
> Issue Type: Improvement
> Environment:
> Reporter: Ethan Wang
>
> Now after -PHOENIX-418- finishes, we want to study to find a proper default
> size for hll hash. Currently the hash size is hard coded as 25/16 bits by
> default (a design we follow Apache Druid. discussion see CALCITE-1588).
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,
> {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129
> of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
> To try on a bigger size, the performance of hll under different bucket/hash
> size has been studied here:
> https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
> 2, When the estimate cardinalities is large enough, as Timok and Flajolet et
> al found out, this performance of hll will become problematic because the
> hash collisions (saturation). In fact, Timok proposed that any number larger
> than {code}2^{32}/30{code} should consider "to large" for a 32 bit hash. See
> study [Google’s Take On Engineering
> HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
> and suggested by the Figure 8 of this
> [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)