[
https://issues.apache.org/jira/browse/METRON-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678076#comment-15678076
]
ASF GitHub Bot commented on METRON-562:
---------------------------------------
Github user james-sirota commented on a diff in the pull request:
https://github.com/apache/incubator-metron/pull/352#discussion_r88757519
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -0,0 +1,346 @@
+# Statistics and Mathematical Functions
+
+A variety of non-trivial and advanced analytics make use of statistics
+and advanced mathematical functions. Particular, capturing the
+statistical snapshots in a scalable way can open up doors for more
+advanced analytics such as outlier analysis. As such, this project is
+aimed at capturing a robust set of statistical functions and
+statistical-based algorithms in the form of Stellar functions. These
+functions can be used from everywhere where Stellar is used.
+
+##Stellar Functions
+
+### Mathematical Functions
+* `ABS`
+ * Description: Returns the absolute value of a number.
+ * Input:
+ * number - The number to take the absolute value of
+ * Returns: The absolute value of the number passed in.
+
+
+### Distributional Statistics
+
+* `STATS_ADD`
+ * Description: Adds one or more input values to those that are used to
calculate the summary statistics.
+ * Input:
+ * stats - The Stellar statistics object. If null, then a new one is
initialized.
+ * value+ - One or more numbers to add
+ * Returns: A Stellar statistics object
+* `STATS_COUNT`
+ * Description: Calculates the count of the values accumulated (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The count of the values in the window or NaN if the
statistics object is null.
+* `STATS_GEOMETRIC_MEAN`
+ * Description: Calculates the geometric mean of the accumulated values
(or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The geometric mean of the values in the window or NaN if the
statistics object is null.
+* `STATS_INIT`
+ * Description: Initializes a statistics object
+ * Input:
+ * window_size - The number of input data values to maintain in a
rolling window in memory. If window_size is equal to 0, then no rolling window
is maintained. Using no rolling window is less memory intensive, but cannot
calculate certain statistics like percentiles and kurtosis.
+ * Returns: A Stellar statistics object
+* `STATS_KURTOSIS`
+ * Description: Calculates the kurtosis of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The kurtosis of the values in the window or NaN if the
statistics object is null.
+* `STATS_MAX`
+ * Description: Calculates the maximum of the accumulated values (or in
the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The maximum of the accumulated values in the window or NaN if
the statistics object is null.
+* `STATS_MEAN`
+ * Description: Calculates the mean of the accumulated values (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The mean of the values in the window or NaN if the statistics
object is null.
+* `STATS_MERGE`
+ * Description: Merges statistics objects.
+ * Input:
+ * statistics - A list of statistics objects
+ * Returns: A Stellar statistics object
+* `STATS_MIN`
+ * Description: Calculates the minimum of the accumulated values (or in
the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The minimum of the accumulated values in the window or NaN if
the statistics object is null.
+* `STATS_PERCENTILE`
+ * Description: Computes the p'th percentile of the accumulated values
(or in the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * p - a double where 0 <= p < 1 representing the percentile
+ * Returns: The p'th percentile of the data or NaN if the statistics
object is null
+* `STATS_POPULATION_VARIANCE`
+ * Description: Calculates the population variance of the accumulated
values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The population variance of the values in the window or NaN if
the statistics object is null.
+* `STATS_QUADRATIC_MEAN`
+ * Description: Calculates the quadratic mean of the accumulated values
(or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The quadratic mean of the values in the window or NaN if the
statistics object is null.
+* `STATS_SD`
+ * Description: Calculates the standard deviation of the accumulated
values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The standard deviation of the values in the window or NaN if
the statistics object is null.
+* `STATS_SKEWNESS`
+ * Description: Calculates the skewness of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The skewness of the values in the window or NaN if the
statistics object is null.
+* `STATS_SUM`
+ * Description: Calculates the sum of the accumulated values (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the values in the window or NaN if the statistics
object is null.
+* `STATS_SUM_LOGS`
+ * Description: Calculates the sum of the (natural) log of the
accumulated values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the (natural) log of the values in the window or
NaN if the statistics object is null.
+* `STATS_SUM_SQUARES`
+ * Description: Calculates the sum of the squares of the accumulated
values (or in the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the squares of the values in the window or NaN if
the statistics object is null.
+* `STATS_VARIANCE`
+ * Description: Calculates the variance of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The variance of the values in the window or NaN if the
statistics object is null.
+
+### Statistical Outlier Detection
+
+* `OUTLIER_MAD_STATE_MERGE`
+ * Description: Update the statistical state required to compute the
Median Absolute Deviation.
+ * Input:
+ * [state] - A list of Median Absolute Deviation States to merge.
Generally these are states across time.
+ * currentState? - The current state (optional)
+ * Returns: The Median Absolute Deviation state
+* `OUTLIER_MAD_ADD`
+ * Description: Add a piece of data to the state.
+ * Input:
+ * state - The MAD state
+ * value - The numeric value to add
+ * Returns: The MAD state
+* `OUTLIER_MAD_SCORE`
+ * Description: Get the modified z-score normalized by the MAD: scale * |
x_i - median(X) | / MAD. See the first page of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+ * Input:
+ * state - The MAD state
+ * value - The numeric value to score
+ * scale? - Optionally the scale to use when computing the modified
z-score. Default is `0.6745`, see the first page of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+ * Returns: The modified z-score
+
+# Outlier Analysis
+
+A common desire is to find anomalies in numerical data. To that end,
+we have some simple statistical anomaly detectors.
+
+## Median Absolute Deviation
+
+Much has been written about this robust estimator. See the first page
+of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+for a good coverage of the good and the bad of MAD. The usage, however
+is fairly straightforward:
+* Gather the statistical state required to compute the MAD
+ * The distribution of the values of a univariate random variable over
time.
+ * The distribution of the absolute deviations of the values from the
median.
+* Use this statistical state to score unseen values. The higher the
score, the more unlike the previously seen data the value is.
+
+There are a couple of issues which make MAD a bit hard to compute.
+First, the statistical state requires computing median, which can be
+computationally expensive to compute exactly. To get around this, we
+use the OnlineStatisticalProvider to compute a sketch rather than the
+exact median. Secondly, the statistical state for seasonal data should
+be limited to a fixed, trailing window. We do this by ensuring that the
+MAD state is mergeable and able to be queried from within the Profiler.
+
+### Example
+
+We will create a dummy data stream of gaussian noise to illustrate how
+to use the MAD functionality along with the profiler to tag messages as
+outliers or not.
+
+To do this, we will create a
+* data generator
+* parser
+* profiler profile
+* enrichment and threat triage
+
+#### Data Generator
+
+We can create a simple python script to generate a stream of gaussian
+noise at the frequency of one message per second as a python script
+which should be saved at `~/rand_gen.py`:
+```
+#!/usr/bin/python
+import random
+import sys
+import time
+def main():
+ mu = float(sys.argv[1])
+ sigma = float(sys.argv[2])
+ freq_s = int(sys.argv[3])
+ while True:
+ print str(random.gauss(mu, sigma))
+ sys.stdout.flush()
+ time.sleep(freq_s)
+
+if __name__ == '__main__':
+ main()
+```
+
+This script will take the following as arguments:
+* The mean of the data generated
+* The standard deviation of the data generated
+* The frequency (in seconds) of the data generated
+
+#### The Parser
+
+We will create a parser that will take the single numbers in and create
+a message with a field called `value` in them using the `CSVParser`.
+
+Add the following file to
+`$METRON_HOME/config/zookeeper/parsers/mad.json`:
+```
+{
+ "parserClassName" : "org.apache.metron.parsers.csv.CSVParser"
+ ,"sensorTopic" : "mad"
+ ,"parserConfig" : {
+ "columns" : {
+ "value_str" : 0
+ }
+ }
+ ,"fieldTransformations" : [
+ {
+ "transformation" : "STELLAR"
+ ,"output" : [ "value" ]
+ ,"config" : {
+ "value" : "TO_DOUBLE(value_str)"
+ }
+ }
+ ]
+}
+```
+
+#### Enrichment and Threat Intel
+
+We will set a threat triage level of `10` if a message generates a outlier
score of more than 3.5.
+This cutoff will depend on your data and should be adjusted based on the
+assumed underlying distribution. Note that under the assumptions of
+normality, MAD will act as a robust estimator of the standard deviation,
so the cutoff
+should be considered the number of standard deviations away. For other
+distributions, there are other interpretations which will make sense in
+the context of measuring the "degree different". See
+http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
+for a brief discussion of this.
+
+Create the following in
+`$METRON_HOME/config/zookeeper/enrichments/mad.json`:
+
+```
+{
+ "index": "mad",
+ "batchSize": 1,
+ "enrichment": {
+ "fieldMap": {
+ "stellar" : {
+ "config" : {
+ "parser_score" : "OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(
+PROFILE_GET( 'sketchy_mad', 'global', 10, 'MINUTES') ), value)"
+ ,"is_alert" : "if parser_score > 3.5 then true else is_alert"
--- End diff --
Can 3.5 be pulled out into a hyper parameter?
> Add rudimentary statistical outlier detection
> ---------------------------------------------
>
> Key: METRON-562
> URL: https://issues.apache.org/jira/browse/METRON-562
> Project: Metron
> Issue Type: New Feature
> Reporter: Casey Stella
> Assignee: Casey Stella
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> With the advent of the profiler, we can now capture state. Furthermore, with
> Stellar, we can capture statistical summaries. We should provide rudimentary
> outlier detection functionality in the form of Stellar functions that can
> operate on captured state from the profiler.
> To begin, we should enable simple outlier tests using distance from a central
> measure such as Median Absolute Deviation (see
> http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)