Repository: incubator-hivemall Updated Branches: refs/heads/master 0c1447f45 -> 85f8e173a
Close #40: Add documentation for SST and ChangeFinder Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/85f8e173 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/85f8e173 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/85f8e173 Branch: refs/heads/master Commit: 85f8e173a2a97005c00b84140f4b9150060c4a56 Parents: 0c1447f Author: Takuya Kitazawa <[email protected]> Authored: Wed Feb 8 18:18:38 2017 +0900 Committer: myui <[email protected]> Committed: Wed Feb 8 18:18:38 2017 +0900 ---------------------------------------------------------------------- .../resources/hivemall/anomaly/synthetic5d.t.gz | Bin 0 -> 92896 bytes docs/gitbook/SUMMARY.md | 2 + docs/gitbook/anomaly/changefinder.md | 146 ++++++++++++++++++ docs/gitbook/anomaly/lof.md | 2 +- docs/gitbook/anomaly/sst.md | 154 +++++++++++++++++++ 5 files changed, 303 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz ---------------------------------------------------------------------- diff --git a/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz b/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz new file mode 100644 index 0000000..f077a81 Binary files /dev/null and b/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz differ http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 76f7924..5b080d2 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -144,6 +144,8 @@ ## Part IX - Anomaly Detection * [Outlier Detection using Local Outlier Factor (LOF)](anomaly/lof.md) +* [Change-Point Detection using Singular Spectrum Transformation (SST)](anomaly/sst.md) +* [ChangeFinder: Detecting Outlier and Change-Point Simultaneously](anomaly/changefinder.md) ## Part X - Hivemall on Spark http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/changefinder.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/anomaly/changefinder.md b/docs/gitbook/anomaly/changefinder.md new file mode 100644 index 0000000..7157e5d --- /dev/null +++ b/docs/gitbook/anomaly/changefinder.md @@ -0,0 +1,146 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +In a context of anomaly detection, there are two types of anomalies, ***outlier*** and ***change-point***, as discussed in [this section](sst.md#outlier-vs-change-point). Hivemall has two functions which respectively detect outliers and change-points; the former is [Local Outlier Detection](lof.md), and the latter is [Singular Spectrum Transformation](sst.md). + +In some cases, we might want to detect outlier and change-point simultaneously in order to figure out characteristics of a time series both in a local and global scale. **ChangeFinder** is an anomaly detection technique which enables us to detect both of outliers and change-points in a single framework. A key reference for the technique is: + +* K. Yamanishi and J. Takeuchi. [A Unifying Framework for Detecting Outliers and Change Points from Non-Stationary Time Series Data](http://dl.acm.org/citation.cfm?id=775148). KDD'02. + +<!-- toc --> + +# Outlier and Change-Point Detection using ChangeFinder + +By using Twitter's time series data we prepared in [this section](sst.md#data-preparation), let us try to use ChangeFinder on Hivemall. + +``` +use twitter; +``` + +A function `changefinder()` can be used in a very similar way to `sst()`, a UDF for [Singular Spectrum Transformation](sst.md). The following query detects outliers and change-points with different thresholds: + +```sql +SELECT + num, + changefinder(value, "-outlier_threshold 0.03 -changepoint_threshold 0.0035") AS result +FROM + timeseries +ORDER BY num ASC +; +``` + +As a consequence, finding outliers and change-points in the data points should be easy: + +| num | result | +|:---:|:---| +|...|...| +|16 | {"outlier_score":0.051287243859365894,"changepoint_score":0.003292139657059704,"is_outlier":true,"is_changepoint":false}| +|17 | {"outlier_score":0.03994335565212781,"changepoint_score":0.003484242549446824,"is_outlier":true,"is_changepoint":false}| +|18 | {"outlier_score":0.9153515196592132,"changepoint_score":0.0036439645550477373,"is_outlier":true,"is_changepoint":true}| +|19 | {"outlier_score":0.03940593403992665,"changepoint_score":0.0035825157392152134,"is_outlier":true,"is_changepoint":true}| +|20 | {"outlier_score":0.27172093630215555,"changepoint_score":0.003542822324886785,"is_outlier":true,"is_changepoint":true}| +|21 | {"outlier_score":0.006784031454620809,"changepoint_score":0.0035029441620275975,"is_outlier":false,"is_changepoint":true}| +|22 | {"outlier_score":0.011838969816513334,"changepoint_score":0.003519599336202336,"is_outlier":false,"is_changepoint":true}| +|23 | {"outlier_score":0.09609857927656007,"changepoint_score":0.003478729798944702,"is_outlier":true,"is_changepoint":false}| +|24 | {"outlier_score":0.23927000145081978,"changepoint_score":0.0034338476757061237,"is_outlier":true,"is_changepoint":false}| +|25 | {"outlier_score":0.04645945042821564,"changepoint_score":0.0034052091926036914,"is_outlier":true,"is_changepoint":false}| +|...|...| + +# ChangeFinder for Multi-Dimensional Data + +ChangeFinder additionally supports multi-dimensional data. Let us try this on synthetic data. + +## Data preparation + +You first need to get synthetic 5-dimensional data from [HERE](https://github.com/apache/incubator-hivemall/blob/master/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz?raw=true) and uncompress to a `synthetic5d.t` file: + +``` +$ head synthetic5d.t +0#71.45185411564131#54.456141290891466#71.78932846605129#76.73002575911214#81.71265594077099 +1#58.374230566196786#57.9798651697631#75.65793151143754#73.76101930504493#69.50315805346253 +2#66.3595943896099#52.866595973073295#76.7987325026338#78.95890786682095#74.67527753118893 +3#58.242560151043236#52.449574430621226#73.20383710416358#77.81502394558085#76.59077723631032 +4#55.89878019680371#52.69611781315756#75.02482987204824#74.11154526135637#75.86881583921179 +5#56.93554246767561#56.55687136423391#74.4056583421317#73.82419594611444#71.3017150863033 +6#65.55704393868689#52.136347983404974#71.14213602046532#72.87394198561904#73.40278960429114 +7#56.65735280596217#57.293605941063035#75.36713340281246#80.70254745535183#75.32423746923857 +8#61.22095211566127#53.47603728473668#77.48215321523912#80.7760107465893#74.43951386292905 +9#52.47574856682803#52.03250504263378#77.59550963025158#76.16623830860391#76.98394610743863 +``` + +The first column indicates a dummy timestamp, and the following four columns are values in each dimension. + +Second, the following Hive operations create a Hive table for the data: + +``` +create database synthetic; +use synthetic; +``` + +```sql +CREATE EXTERNAL TABLE synthetic5d ( + num INT, + value1 DOUBLE, + value2 DOUBLE, + value3 DOUBLE, + value4 DOUBLE, + value5 DOUBLE +) ROW FORMAT DELIMITED +FIELDS TERMINATED BY '#' +STORED AS TEXTFILE +LOCATION '/dataset/synthetic/synthetic5d'; +``` + +Finally, you can load the synthetic data to the table by: + +``` +$ hadoop fs -put synthetic5d.t /dataset/synthetic/synthetic5d +``` + +## Detecting outliers and change-points of the 5-dimensional data + +Using `changefinder()` for multi-dimensional data requires us to pass the first argument as an array. In our case, the data is 5-dimensional, so the first argument should be an array with 5 elements. Except for that point, basic usage of the function is same as the previous 1-dimensional example: + +```sql +SELECT + num, + changefinder(array(value1, value2, value3, value4, value5), + "-outlier_threshold 0.015 -changepoint_threshold 0.0045") AS result +FROM + synthetic5d +ORDER BY num ASC +; +``` + +Output might be: + +| num | result | +|:---:|:---| +|...|...| +|90 | {"outlier_score":0.014014718350674471,"changepoint_score":0.004520174906936474,"is_outlier":false,"is_changepoint":true}| +|91 | {"outlier_score":0.013145554693405614,"changepoint_score":0.004480713237042799,"is_outlier":false,"is_changepoint":false}| +|92 | {"outlier_score":0.011631759675989617,"changepoint_score":0.004442031415725316,"is_outlier":false,"is_changepoint":false}| +|93 | {"outlier_score":0.012140065235943798,"changepoint_score":0.004404170732687428,"is_outlier":false,"is_changepoint":false}| +|94 | {"outlier_score":0.012555903663657997,"changepoint_score":0.0043670553008087355,"is_outlier":false,"is_changepoint":false}| +|95 | {"outlier_score":0.013503247137325314,"changepoint_score":0.0043306667027628466,"is_outlier":false,"is_changepoint":false}| +|96 | {"outlier_score":0.013896893553710932,"changepoint_score":0.004294969164345527,"is_outlier":false,"is_changepoint":false}| +|97 | {"outlier_score":0.01322874844578159,"changepoint_score":0.004259994590721001,"is_outlier":false,"is_changepoint":false}| +|98 | {"outlier_score":0.019383618511936707,"changepoint_score":0.004225604978710543,"is_outlier":true,"is_changepoint":false}| +|99 | {"outlier_score":0.01121758589038846,"changepoint_score":0.004191881992962213,"is_outlier":false,"is_changepoint":false}| +|...|...| http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/lof.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/anomaly/lof.md b/docs/gitbook/anomaly/lof.md index 39a6e9f..c2d396b 100644 --- a/docs/gitbook/anomaly/lof.md +++ b/docs/gitbook/anomaly/lof.md @@ -17,7 +17,7 @@ under the License. --> -This article introduce how to find outliers using [Local Outlier Detection (LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall. +This article introduces how to find outliers using [Local Outlier Detection (LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall. <!-- toc --> http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/sst.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/anomaly/sst.md b/docs/gitbook/anomaly/sst.md new file mode 100644 index 0000000..7268eda --- /dev/null +++ b/docs/gitbook/anomaly/sst.md @@ -0,0 +1,154 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +This page introduces how to find change-points using **Singular Spectrum Transformation** (SST) on Hivemall. The following papers describe the details of this technique: + +* T. Idé and K. Inoue. [Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations](http://epubs.siam.org/doi/abs/10.1137/1.9781611972757.63). SDM'05. +* T. Idé and K. Tsuda. [Change-Point Detection using Krylov Subspace Learning](http://epubs.siam.org/doi/abs/10.1137/1.9781611972771.54). SDM'07. + +<!-- toc --> + +# Outlier vs Change-Point + +It is important that anomaly detectors are generally categorized into outlier and change-point detectors. Outliers are some spiky "local" data points which are suddenly observed in a series of normal samples, and [Local Outlier Detection](lof.md) is an algorithm to detect outliers. On the other hand, change-points indicate "global" change on a wider scale in terms of characteristics of data points. + +In this page, we specially focus on change-point detection. More concretely, the following sections introduce a way to detect change-points on Hivemall, by using a specific technique named Singular Spectrum Transformation (SST). + +# Data Preparation + +## Get Twitter's data + +We use time series data points provided by Twitter in the following article: [Introducing practical and robust anomaly detection in a time series](https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series). In fact, the dataset is originally created for R, but we can get CSV version of the same data from [HERE](https://github.com/apache/incubator-hivemall/blob/master/core/src/test/resources/hivemall/anomaly/twitter.csv.gz?raw=true). + +Once you uncompressed the downloaded `.gz` file, you can see a CSV file: + +``` +$ head twitter.csv +182.478 +176.231 +183.917 +177.798 +165.469 +181.878 +184.502 +183.303 +177.578 +171.641 +``` + +These values are sequential data points. Our goal is to detect change-points in the samples. Here, let us insert a dummy timestamp into each line as follows: + +``` +$ awk '{printf "%d#%s\n", NR, $0}' < twitter.csv > twitter.t +``` + +``` +$ head twitter.t +1#182.478 +2#176.231 +3#183.917 +4#177.798 +5#165.469 +6#181.878 +7#184.502 +8#183.303 +9#177.578 +10#171.641 +``` + +Now, Hive can understand sequence of the samples by just looking dummy timestamp. + +## Importing data as a Hive table + +### Create a Hive table + +You first need to launch a Hive console and run the following operations: + +``` +create database twitter; +use twitter; +``` + +```sql +CREATE EXTERNAL TABLE timeseries ( + num INT, + value DOUBLE +) ROW FORMAT DELIMITED +FIELDS TERMINATED BY '#' +STORED AS TEXTFILE +LOCATION '/dataset/twitter/timeseries'; +``` + +### Load data into the table + +Next, the `.t` file we have generated before can be loaded to the table by: + +``` +$ hadoop fs -put twitter.t /dataset/twitter/timeseries +``` + +`timeseries` table in `twitter` database should be: + +| num | value | +|:---:|:---:| +|1|182.478| +|2|176.231| +|3|183.917| +|4|177.798| +|5|165.469| +|...|...| + +# Change-Point Detection using SST + +We are now ready to detect change-points. A UDF `sst()` takes a `double` value as the first argument, and you can set options in the second argument. + +What the following query does is to detect change-points from a `value` column in the `timeseries` table. An option `"-threshold 0.005"` means that a data point is detected as a change-point if its score is greater than 0.005. + +``` +use twitter; +``` + +```sql +SELECT + num, + sst(value, "-threshold 0.005") AS result +FROM + timeseries +ORDER BY num ASC +; +``` + +For instance, partial outputs obtained as a result of this query are: + +| num | result | +|:---:|:---| +|...|...| +|7551 | {"changepoint_score":0.00453049288071683,"is_changepoint":false}| +|7552 | {"changepoint_score":0.004711244102524104,"is_changepoint":false}| +|7553 | {"changepoint_score":0.004814871928978115,"is_changepoint":false}| +|7554 | {"changepoint_score":0.004968089640799422,"is_changepoint":false}| +|7555 | {"changepoint_score":0.005709056330104878,"is_changepoint":true}| +|7556 | {"changepoint_score":0.0044279766655132,"is_changepoint":false}| +|7557 | {"changepoint_score":0.0034694956722586268,"is_changepoint":false}| +|7558 | {"changepoint_score":0.002549056569322694,"is_changepoint":false}| +|7559 | {"changepoint_score":0.0017395109108403473,"is_changepoint":false}| +|7560 | {"changepoint_score":0.0010629833145070489,"is_changepoint":false}| +|...|...| + +Obviously, the 7555-th sample is detected as a change-point in this example. \ No newline at end of file
