[
https://issues.apache.org/jira/browse/METRON-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754809#comment-15754809
]
ASF GitHub Bot commented on METRON-627:
---------------------------------------
Github user cestella commented on a diff in the pull request:
https://github.com/apache/incubator-metron/pull/397#discussion_r92835008
--- Diff:
metron-platform/metron-common/src/main/java/org/apache/metron/common/dsl/functions/DataStructureFunctions.java
---
@@ -137,6 +138,105 @@ public Object apply(List<Object> args) {
}
}
+ @Stellar( namespace="HLLP"
+ , name="CARDINALITY"
+ , description="Returns HyperLogLogPlus-estimated cardinality for
this set"
+ , params = { "hyperLogLogPlus - the hllp set" }
+ , returns = "Long value representing the cardinality for this
set"
+ )
+ public static class HLLPCardinality extends BaseStellarFunction {
+
+ @Override
+ public Object apply(List<Object> args) {
+ if (args.size() < 1) {
+ throw new IllegalArgumentException("Must pass an hllp set to get
the cardinality for");
+ }
+ return ((HyperLogLogPlus) args.get(0)).cardinality();
+ }
+ }
+
+ @Stellar( namespace="HLLP"
+ , name="INIT"
+ , description="Initializes the set"
+ , params = {
+ "p (required) - the precision value for the normal
set"
--- End diff --
Are their any documents that we can link to describe the tradeoffs for
these values? I'm thinking of something like
[here](https://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg)
discussing the tradeoff between accuracy and size.
> Add HyperLogLogPlus implementation to Stellar
> ---------------------------------------------
>
> Key: METRON-627
> URL: https://issues.apache.org/jira/browse/METRON-627
> Project: Metron
> Issue Type: Improvement
> Reporter: Michael Miklavcic
>
> Calculating set cardinality can be a useful tool for a security analyst. For
> instance, a large volume of non-unique src ip addresses hitting your network
> may be an indication that you are currently under attack. There have been
> many advancements in distinct value (DV) estimation over the years. We have
> seen implementations evolve from K-Minimum-Values (KMV), to LogLog, to
> HyperLogLog, and now to Google's much-improved HyperLogLogPlu algorithm. The
> key improvements in this latest manifestation of the algorithm are:
> moves to a 64-bit hash
> handles sparse sets
> is more accurate with small cardinality
> This Jira tracks the effort to add a HyperLogLogPlus implementation to Metron.
> References:
> https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)