[
https://issues.apache.org/jira/browse/METRON-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770287#comment-15770287
]
ASF GitHub Bot commented on METRON-637:
---------------------------------------
Github user cestella commented on the issue:
https://github.com/apache/incubator-metron/pull/401
Hmm, well, the idea is to get the statistical bin for a given value. The
sampling portion is happening in the profiler as you construct the statistical
distribution. This is just querying the already constructed distribution.
This is intended to be an input into a model and to give a sense of where a
value falls in the distribution of values that have preceded it, so if you want
a message to be used by your model, then it'll need to be computed and passed.
I'll give you an example that may help clarify. Let's say I'm building a
really, really naive model over DNS requests to determine if a DNS request is
being made for a synthetic domain created by a botnet. One of the features I
might be interested in may be the length of the domain. However, I may also
want to get a sense of if this domain's length is outside of the norm or not.
To do that, I'd create a profile that captures the statistical distribution of
the length of the domains that have been requested in the past and when I'm
calling the model (which is deployed via Model as a Service), I can pass the
statistical bin that the length falls into (e.g. between the min-25th
percentile, 25 - 50th percentile, 50th - 75th percentile or 75th - 95th
percentile, 95th - max) by using this function. So, every DNS request really
needs to be scored in this scenario.
I hope that makes sense.
> Add a STATS_BIN function to Stellar.
> ------------------------------------
>
> Key: METRON-637
> URL: https://issues.apache.org/jira/browse/METRON-637
> Project: Metron
> Issue Type: Improvement
> Reporter: Casey Stella
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> When passing parameters to models, it's often useful to pass the binned
> representation of a variable based on an empirical statistical distribution,
> rather than the actual variable. This function should accept a set of
> percentile bins and a statistical sketch and a value. It should return the
> index where the percentile of the value falls.
> For instance, consider the value 17 who is percentile 27. If we use 25, 75,
> 95 to define our bins, this function would return 1, because its percentile,
> 27, is between 25 and 75.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)