[
https://issues.apache.org/jira/browse/METRON-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752516#comment-15752516
]
ASF GitHub Bot commented on METRON-627:
---------------------------------------
GitHub user mmiklavc opened a pull request:
https://github.com/apache/incubator-metron/pull/397
METRON-627: Add HyperLogLogPlus implementation to Stellar
This PR addresses https://issues.apache.org/jira/browse/METRON-627
Leverages the HLLP implementation from
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java
4 new Stellar functions have been added that allow a user to initialize a
cardinality estimator, add items, merge estimators, and calculate cardinality
estimates.
### `HLLP_CARDINALITY`
* Description: Returns HyperLogLogPlus-estimated cardinality for this set
* Input:
* hyperLogLogPlus - the hllp set
* Returns: Long value representing the cardinality for this set
### `HLLP_INIT`
* Description: Initializes the set
* Input:
* p (required) - the precision value for the normal set
* sp - the precision value for the sparse set. If sp is not specified
the sparse set will be disabled.
* Returns: A new HyperLogLogPlus set
### `HLLP_MERGE`
* Description: Merge hllp sets together
* Input:
* hllp1 - first hllp set
* hllp2 - second hllp set
* hllpn - additional sets to merge
* Returns: A new merged HyperLogLogPlus estimator set
### `HLLP_OFFER`
* Description: Add value to the set
* Input:
* hyperLogLogPlus - the hllp set
* o - Object to add to the set
* Returns: The HyperLogLogPlus set with a new object added
**Note:** Added new library to metron-common pom and added 3 new items to
dependencies_with_url.csv.
**Testing**
Spun up the Stellar REPL in quick-dev. And verified that the function
composition is working as expected and returning correct cardinality estimates
for simple sparse set cases. For example:
```
[Stellar]>>> HLLP_CARDINALITY(HLLP_MERGE(
HLLP_OFFER(HLLP_OFFER(HLLP_INIT(5, 6), "runnings"), "cool"),
HLLP_OFFER(HLLP_OFFER(HLLP_INIT(5, 6), "bobsled"), "team")))
4
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mmiklavc/incubator-metron hyperloglog
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-metron/pull/397.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #397
----
commit afce30539f6996a607e85d3fd35aac5fcb5c19aa
Author: Michael Miklavcic <[email protected]>
Date: 2016-12-15T20:55:39Z
METRON-627: Add HyperLogLogPlus implementation to Stellar
----
> Add HyperLogLogPlus implementation to Stellar
> ---------------------------------------------
>
> Key: METRON-627
> URL: https://issues.apache.org/jira/browse/METRON-627
> Project: Metron
> Issue Type: Improvement
> Reporter: Michael Miklavcic
>
> Calculating set cardinality can be a useful tool for a security analyst. For
> instance, a large volume of non-unique src ip addresses hitting your network
> may be an indication that you are currently under attack. There have been
> many advancements in distinct value (DV) estimation over the years. We have
> seen implementations evolve from K-Minimum-Values (KMV), to LogLog, to
> HyperLogLog, and now to Google's much-improved HyperLogLogPlu algorithm. The
> key improvements in this latest manifestation of the algorithm are:
> moves to a 64-bit hash
> handles sparse sets
> is more accurate with small cardinality
> This Jira tracks the effort to add a HyperLogLogPlus implementation to Metron.
> References:
> https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)