GitHub user myui opened a pull request:
https://github.com/apache/incubator-hivemall/pull/125
approx_distinct_count UDAF using HyperLogLog++
## What changes were proposed in this pull request?
This PR introduce `approx_distinct_count` using
[HyperLogLog++](https://en.wikipedia.org/wiki/HyperLogLog#HLL.2B.2B) as
implemented in
[Oracle](https://docs.oracle.com/database/121/SQLRF/functions013.htm#SQLRF56900)
and
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators?hl=en#approx_count_distinct).
[stream-lib](https://github.com/addthis/stream-lib) is used as the library.
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-18
## How was this patch tested?
manual tests
## How to use this feature?
As described in [this markdown
document](https://github.com/myui/incubator-hivemall/blob/e52fda9699c14687d62d5bfcd13459982f09193c/docs/gitbook/misc/approx.md).
## Checklist
- [x] Did you apply source code formatter, i.e., `mvn formatter:format`,
for your commit?
- [x] Did you run system tests on Hive (or Spark)?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/myui/incubator-hivemall sketch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/125.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #125
----
commit 05a0d714f5199ecdd6159f1e45817b74ccb7f0b8
Author: Makoto Yui <[email protected]>
Date: 2017-11-21T12:03:01Z
Minor bugfix in fmeasure UDAF
commit 3ae28623c08550cf1e6edc99e23204f61f3df074
Author: Makoto Yui <[email protected]>
Date: 2017-11-21T12:03:31Z
Added approx_count_distinct using HyperLogLogPlus
commit a75285383bd4190cf7ba156d8968e4872666382b
Author: Makoto Yui <[email protected]>
Date: 2017-11-21T12:40:35Z
Added TOC
commit e52fda9699c14687d62d5bfcd13459982f09193c
Author: Makoto Yui <[email protected]>
Date: 2017-11-21T12:40:57Z
Added documentation about Hyperloglog
----
---