Github user james-sirota commented on a diff in the pull request:
https://github.com/apache/incubator-metron/pull/352#discussion_r88757519
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -0,0 +1,346 @@
+# Statistics and Mathematical Functions
+
+A variety of non-trivial and advanced analytics make use of statistics
+and advanced mathematical functions. Particular, capturing the
+statistical snapshots in a scalable way can open up doors for more
+advanced analytics such as outlier analysis. As such, this project is
+aimed at capturing a robust set of statistical functions and
+statistical-based algorithms in the form of Stellar functions. These
+functions can be used from everywhere where Stellar is used.
+
+##Stellar Functions
+
+### Mathematical Functions
+* `ABS`
+ * Description: Returns the absolute value of a number.
+ * Input:
+ * number - The number to take the absolute value of
+ * Returns: The absolute value of the number passed in.
+
+
+### Distributional Statistics
+
+* `STATS_ADD`
+ * Description: Adds one or more input values to those that are used to
calculate the summary statistics.
+ * Input:
+ * stats - The Stellar statistics object. If null, then a new one is
initialized.
+ * value+ - One or more numbers to add
+ * Returns: A Stellar statistics object
+* `STATS_COUNT`
+ * Description: Calculates the count of the values accumulated (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The count of the values in the window or NaN if the
statistics object is null.
+* `STATS_GEOMETRIC_MEAN`
+ * Description: Calculates the geometric mean of the accumulated values
(or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The geometric mean of the values in the window or NaN if the
statistics object is null.
+* `STATS_INIT`
+ * Description: Initializes a statistics object
+ * Input:
+ * window_size - The number of input data values to maintain in a
rolling window in memory. If window_size is equal to 0, then no rolling window
is maintained. Using no rolling window is less memory intensive, but cannot
calculate certain statistics like percentiles and kurtosis.
+ * Returns: A Stellar statistics object
+* `STATS_KURTOSIS`
+ * Description: Calculates the kurtosis of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The kurtosis of the values in the window or NaN if the
statistics object is null.
+* `STATS_MAX`
+ * Description: Calculates the maximum of the accumulated values (or in
the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The maximum of the accumulated values in the window or NaN if
the statistics object is null.
+* `STATS_MEAN`
+ * Description: Calculates the mean of the accumulated values (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The mean of the values in the window or NaN if the statistics
object is null.
+* `STATS_MERGE`
+ * Description: Merges statistics objects.
+ * Input:
+ * statistics - A list of statistics objects
+ * Returns: A Stellar statistics object
+* `STATS_MIN`
+ * Description: Calculates the minimum of the accumulated values (or in
the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The minimum of the accumulated values in the window or NaN if
the statistics object is null.
+* `STATS_PERCENTILE`
+ * Description: Computes the p'th percentile of the accumulated values
(or in the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * p - a double where 0 <= p < 1 representing the percentile
+ * Returns: The p'th percentile of the data or NaN if the statistics
object is null
+* `STATS_POPULATION_VARIANCE`
+ * Description: Calculates the population variance of the accumulated
values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The population variance of the values in the window or NaN if
the statistics object is null.
+* `STATS_QUADRATIC_MEAN`
+ * Description: Calculates the quadratic mean of the accumulated values
(or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The quadratic mean of the values in the window or NaN if the
statistics object is null.
+* `STATS_SD`
+ * Description: Calculates the standard deviation of the accumulated
values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The standard deviation of the values in the window or NaN if
the statistics object is null.
+* `STATS_SKEWNESS`
+ * Description: Calculates the skewness of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The skewness of the values in the window or NaN if the
statistics object is null.
+* `STATS_SUM`
+ * Description: Calculates the sum of the accumulated values (or in the
window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the values in the window or NaN if the statistics
object is null.
+* `STATS_SUM_LOGS`
+ * Description: Calculates the sum of the (natural) log of the
accumulated values (or in the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the (natural) log of the values in the window or
NaN if the statistics object is null.
+* `STATS_SUM_SQUARES`
+ * Description: Calculates the sum of the squares of the accumulated
values (or in the window if a window is used).
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The sum of the squares of the values in the window or NaN if
the statistics object is null.
+* `STATS_VARIANCE`
+ * Description: Calculates the variance of the accumulated values (or in
the window if a window is used). See
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
+ * Input:
+ * stats - The Stellar statistics object
+ * Returns: The variance of the values in the window or NaN if the
statistics object is null.
+
+### Statistical Outlier Detection
+
+* `OUTLIER_MAD_STATE_MERGE`
+ * Description: Update the statistical state required to compute the
Median Absolute Deviation.
+ * Input:
+ * [state] - A list of Median Absolute Deviation States to merge.
Generally these are states across time.
+ * currentState? - The current state (optional)
+ * Returns: The Median Absolute Deviation state
+* `OUTLIER_MAD_ADD`
+ * Description: Add a piece of data to the state.
+ * Input:
+ * state - The MAD state
+ * value - The numeric value to add
+ * Returns: The MAD state
+* `OUTLIER_MAD_SCORE`
+ * Description: Get the modified z-score normalized by the MAD: scale * |
x_i - median(X) | / MAD. See the first page of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+ * Input:
+ * state - The MAD state
+ * value - The numeric value to score
+ * scale? - Optionally the scale to use when computing the modified
z-score. Default is `0.6745`, see the first page of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+ * Returns: The modified z-score
+
+# Outlier Analysis
+
+A common desire is to find anomalies in numerical data. To that end,
+we have some simple statistical anomaly detectors.
+
+## Median Absolute Deviation
+
+Much has been written about this robust estimator. See the first page
+of
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
+for a good coverage of the good and the bad of MAD. The usage, however
+is fairly straightforward:
+* Gather the statistical state required to compute the MAD
+ * The distribution of the values of a univariate random variable over
time.
+ * The distribution of the absolute deviations of the values from the
median.
+* Use this statistical state to score unseen values. The higher the
score, the more unlike the previously seen data the value is.
+
+There are a couple of issues which make MAD a bit hard to compute.
+First, the statistical state requires computing median, which can be
+computationally expensive to compute exactly. To get around this, we
+use the OnlineStatisticalProvider to compute a sketch rather than the
+exact median. Secondly, the statistical state for seasonal data should
+be limited to a fixed, trailing window. We do this by ensuring that the
+MAD state is mergeable and able to be queried from within the Profiler.
+
+### Example
+
+We will create a dummy data stream of gaussian noise to illustrate how
+to use the MAD functionality along with the profiler to tag messages as
+outliers or not.
+
+To do this, we will create a
+* data generator
+* parser
+* profiler profile
+* enrichment and threat triage
+
+#### Data Generator
+
+We can create a simple python script to generate a stream of gaussian
+noise at the frequency of one message per second as a python script
+which should be saved at `~/rand_gen.py`:
+```
+#!/usr/bin/python
+import random
+import sys
+import time
+def main():
+ mu = float(sys.argv[1])
+ sigma = float(sys.argv[2])
+ freq_s = int(sys.argv[3])
+ while True:
+ print str(random.gauss(mu, sigma))
+ sys.stdout.flush()
+ time.sleep(freq_s)
+
+if __name__ == '__main__':
+ main()
+```
+
+This script will take the following as arguments:
+* The mean of the data generated
+* The standard deviation of the data generated
+* The frequency (in seconds) of the data generated
+
+#### The Parser
+
+We will create a parser that will take the single numbers in and create
+a message with a field called `value` in them using the `CSVParser`.
+
+Add the following file to
+`$METRON_HOME/config/zookeeper/parsers/mad.json`:
+```
+{
+ "parserClassName" : "org.apache.metron.parsers.csv.CSVParser"
+ ,"sensorTopic" : "mad"
+ ,"parserConfig" : {
+ "columns" : {
+ "value_str" : 0
+ }
+ }
+ ,"fieldTransformations" : [
+ {
+ "transformation" : "STELLAR"
+ ,"output" : [ "value" ]
+ ,"config" : {
+ "value" : "TO_DOUBLE(value_str)"
+ }
+ }
+ ]
+}
+```
+
+#### Enrichment and Threat Intel
+
+We will set a threat triage level of `10` if a message generates a outlier
score of more than 3.5.
+This cutoff will depend on your data and should be adjusted based on the
+assumed underlying distribution. Note that under the assumptions of
+normality, MAD will act as a robust estimator of the standard deviation,
so the cutoff
+should be considered the number of standard deviations away. For other
+distributions, there are other interpretations which will make sense in
+the context of measuring the "degree different". See
+http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
+for a brief discussion of this.
+
+Create the following in
+`$METRON_HOME/config/zookeeper/enrichments/mad.json`:
+
+```
+{
+ "index": "mad",
+ "batchSize": 1,
+ "enrichment": {
+ "fieldMap": {
+ "stellar" : {
+ "config" : {
+ "parser_score" : "OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(
+PROFILE_GET( 'sketchy_mad', 'global', 10, 'MINUTES') ), value)"
+ ,"is_alert" : "if parser_score > 3.5 then true else is_alert"
--- End diff --
Can 3.5 be pulled out into a hyper parameter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---