Repository: incubator-hivemall
Updated Branches:
  refs/heads/master 0c1447f45 -> 85f8e173a


Close #40: Add documentation for SST and ChangeFinder


Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/85f8e173
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/85f8e173
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/85f8e173

Branch: refs/heads/master
Commit: 85f8e173a2a97005c00b84140f4b9150060c4a56
Parents: 0c1447f
Author: Takuya Kitazawa <[email protected]>
Authored: Wed Feb 8 18:18:38 2017 +0900
Committer: myui <[email protected]>
Committed: Wed Feb 8 18:18:38 2017 +0900

----------------------------------------------------------------------
 .../resources/hivemall/anomaly/synthetic5d.t.gz | Bin 0 -> 92896 bytes
 docs/gitbook/SUMMARY.md                         |   2 +
 docs/gitbook/anomaly/changefinder.md            | 146 ++++++++++++++++++
 docs/gitbook/anomaly/lof.md                     |   2 +-
 docs/gitbook/anomaly/sst.md                     | 154 +++++++++++++++++++
 5 files changed, 303 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz
----------------------------------------------------------------------
diff --git a/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz 
b/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz
new file mode 100644
index 0000000..f077a81
Binary files /dev/null and 
b/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz differ

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 76f7924..5b080d2 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -144,6 +144,8 @@
 ## Part IX - Anomaly Detection
 
 * [Outlier Detection using Local Outlier Factor (LOF)](anomaly/lof.md)
+* [Change-Point Detection using Singular Spectrum Transformation 
(SST)](anomaly/sst.md)
+* [ChangeFinder: Detecting Outlier and Change-Point 
Simultaneously](anomaly/changefinder.md)
 
 ## Part X - Hivemall on Spark
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/changefinder.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/anomaly/changefinder.md 
b/docs/gitbook/anomaly/changefinder.md
new file mode 100644
index 0000000..7157e5d
--- /dev/null
+++ b/docs/gitbook/anomaly/changefinder.md
@@ -0,0 +1,146 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+In a context of anomaly detection, there are two types of anomalies, 
***outlier*** and ***change-point***, as discussed in [this 
section](sst.md#outlier-vs-change-point). Hivemall has two functions which 
respectively detect outliers and change-points; the former is [Local Outlier 
Detection](lof.md), and the latter is [Singular Spectrum 
Transformation](sst.md).
+
+In some cases, we might want to detect outlier and change-point simultaneously 
in order to figure out characteristics of a time series both in a local and 
global scale. **ChangeFinder** is an anomaly detection technique which enables 
us to detect both of outliers and change-points in a single framework. A key 
reference for the technique is:
+
+* K. Yamanishi and J. Takeuchi. [A Unifying Framework for Detecting Outliers 
and Change Points from Non-Stationary Time Series 
Data](http://dl.acm.org/citation.cfm?id=775148). KDD'02.
+
+<!-- toc -->
+
+# Outlier and Change-Point Detection using ChangeFinder
+
+By using Twitter's time series data we prepared in [this 
section](sst.md#data-preparation), let us try to use ChangeFinder on Hivemall.
+
+```
+use twitter;
+```
+
+A function `changefinder()` can be used in a very similar way to `sst()`, a 
UDF for [Singular Spectrum Transformation](sst.md). The following query detects 
outliers and change-points with different thresholds:
+
+```sql
+SELECT
+  num,
+  changefinder(value, "-outlier_threshold 0.03 -changepoint_threshold 0.0035") 
AS result
+FROM
+  timeseries
+ORDER BY num ASC
+;
+```
+
+As a consequence, finding outliers and change-points in the data points should 
be easy:
+
+| num | result |
+|:---:|:---|
+|...|...|
+|16  |    
{"outlier_score":0.051287243859365894,"changepoint_score":0.003292139657059704,"is_outlier":true,"is_changepoint":false}|
+|17  |    
{"outlier_score":0.03994335565212781,"changepoint_score":0.003484242549446824,"is_outlier":true,"is_changepoint":false}|
+|18  |    
{"outlier_score":0.9153515196592132,"changepoint_score":0.0036439645550477373,"is_outlier":true,"is_changepoint":true}|
+|19  |    
{"outlier_score":0.03940593403992665,"changepoint_score":0.0035825157392152134,"is_outlier":true,"is_changepoint":true}|
+|20  |    
{"outlier_score":0.27172093630215555,"changepoint_score":0.003542822324886785,"is_outlier":true,"is_changepoint":true}|
+|21  |    
{"outlier_score":0.006784031454620809,"changepoint_score":0.0035029441620275975,"is_outlier":false,"is_changepoint":true}|
+|22  |    
{"outlier_score":0.011838969816513334,"changepoint_score":0.003519599336202336,"is_outlier":false,"is_changepoint":true}|
+|23  |    
{"outlier_score":0.09609857927656007,"changepoint_score":0.003478729798944702,"is_outlier":true,"is_changepoint":false}|
+|24  |    
{"outlier_score":0.23927000145081978,"changepoint_score":0.0034338476757061237,"is_outlier":true,"is_changepoint":false}|
+|25 |     
{"outlier_score":0.04645945042821564,"changepoint_score":0.0034052091926036914,"is_outlier":true,"is_changepoint":false}|
+|...|...|
+
+# ChangeFinder for Multi-Dimensional Data
+
+ChangeFinder additionally supports multi-dimensional data. Let us try this on 
synthetic data.
+
+## Data preparation
+
+You first need to get synthetic 5-dimensional data from 
[HERE](https://github.com/apache/incubator-hivemall/blob/master/core/src/test/resources/hivemall/anomaly/synthetic5d.t.gz?raw=true)
 and uncompress to a `synthetic5d.t` file:
+
+```
+$ head synthetic5d.t
+0#71.45185411564131#54.456141290891466#71.78932846605129#76.73002575911214#81.71265594077099
+1#58.374230566196786#57.9798651697631#75.65793151143754#73.76101930504493#69.50315805346253
+2#66.3595943896099#52.866595973073295#76.7987325026338#78.95890786682095#74.67527753118893
+3#58.242560151043236#52.449574430621226#73.20383710416358#77.81502394558085#76.59077723631032
+4#55.89878019680371#52.69611781315756#75.02482987204824#74.11154526135637#75.86881583921179
+5#56.93554246767561#56.55687136423391#74.4056583421317#73.82419594611444#71.3017150863033
+6#65.55704393868689#52.136347983404974#71.14213602046532#72.87394198561904#73.40278960429114
+7#56.65735280596217#57.293605941063035#75.36713340281246#80.70254745535183#75.32423746923857
+8#61.22095211566127#53.47603728473668#77.48215321523912#80.7760107465893#74.43951386292905
+9#52.47574856682803#52.03250504263378#77.59550963025158#76.16623830860391#76.98394610743863
+```
+
+The first column indicates a dummy timestamp, and the following four columns 
are values in each dimension. 
+
+Second, the following Hive operations create a Hive table for the data:
+
+```
+create database synthetic;
+use synthetic;
+```
+
+```sql
+CREATE EXTERNAL TABLE synthetic5d (
+       num INT,
+  value1 DOUBLE,
+       value2 DOUBLE,
+       value3 DOUBLE,
+       value4 DOUBLE,
+       value5 DOUBLE
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+STORED AS TEXTFILE
+LOCATION '/dataset/synthetic/synthetic5d';
+```
+
+Finally, you can load the synthetic data to the table by:
+
+```
+$ hadoop fs -put synthetic5d.t /dataset/synthetic/synthetic5d
+```
+
+## Detecting outliers and change-points of the 5-dimensional data
+
+Using `changefinder()` for multi-dimensional data requires us to pass the 
first argument as an array. In our case, the data is 5-dimensional, so the 
first argument should be an array with 5 elements. Except for that point, basic 
usage of the function is same as the previous 1-dimensional example:
+
+```sql
+SELECT
+  num,
+  changefinder(array(value1, value2, value3, value4, value5), 
+               "-outlier_threshold 0.015 -changepoint_threshold 0.0045") AS 
result
+FROM
+  synthetic5d
+ORDER BY num ASC
+;
+```
+
+Output might be:
+
+| num | result |
+|:---:|:---|
+|...|...|
+|90   |   
{"outlier_score":0.014014718350674471,"changepoint_score":0.004520174906936474,"is_outlier":false,"is_changepoint":true}|
+|91   |   
{"outlier_score":0.013145554693405614,"changepoint_score":0.004480713237042799,"is_outlier":false,"is_changepoint":false}|
+|92   |   
{"outlier_score":0.011631759675989617,"changepoint_score":0.004442031415725316,"is_outlier":false,"is_changepoint":false}|
+|93  |    
{"outlier_score":0.012140065235943798,"changepoint_score":0.004404170732687428,"is_outlier":false,"is_changepoint":false}|
+|94   |   
{"outlier_score":0.012555903663657997,"changepoint_score":0.0043670553008087355,"is_outlier":false,"is_changepoint":false}|
+|95   |   
{"outlier_score":0.013503247137325314,"changepoint_score":0.0043306667027628466,"is_outlier":false,"is_changepoint":false}|
+|96   |   
{"outlier_score":0.013896893553710932,"changepoint_score":0.004294969164345527,"is_outlier":false,"is_changepoint":false}|
+|97   |   
{"outlier_score":0.01322874844578159,"changepoint_score":0.004259994590721001,"is_outlier":false,"is_changepoint":false}|
+|98  |    
{"outlier_score":0.019383618511936707,"changepoint_score":0.004225604978710543,"is_outlier":true,"is_changepoint":false}|
+|99  |    
{"outlier_score":0.01121758589038846,"changepoint_score":0.004191881992962213,"is_outlier":false,"is_changepoint":false}|
+|...|...|

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/lof.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/anomaly/lof.md b/docs/gitbook/anomaly/lof.md
index 39a6e9f..c2d396b 100644
--- a/docs/gitbook/anomaly/lof.md
+++ b/docs/gitbook/anomaly/lof.md
@@ -17,7 +17,7 @@
   under the License.
 -->
         
-This article introduce how to find outliers using [Local Outlier Detection 
(LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall.
+This article introduces how to find outliers using [Local Outlier Detection 
(LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall.
 
 <!-- toc -->
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/85f8e173/docs/gitbook/anomaly/sst.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/anomaly/sst.md b/docs/gitbook/anomaly/sst.md
new file mode 100644
index 0000000..7268eda
--- /dev/null
+++ b/docs/gitbook/anomaly/sst.md
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+        
+This page introduces how to find change-points using **Singular Spectrum 
Transformation** (SST) on Hivemall. The following papers describe the details 
of this technique:
+
+* T. Idé and K. Inoue. [Knowledge Discovery from Heterogeneous Dynamic 
Systems using Change-Point 
Correlations](http://epubs.siam.org/doi/abs/10.1137/1.9781611972757.63). SDM'05.
+* T. Idé and K. Tsuda. [Change-Point Detection using Krylov Subspace 
Learning](http://epubs.siam.org/doi/abs/10.1137/1.9781611972771.54). SDM'07.
+
+<!-- toc -->
+
+# Outlier vs Change-Point
+
+It is important that anomaly detectors are generally categorized into outlier 
and change-point detectors. Outliers are some spiky "local" data points which 
are suddenly observed in a series of normal samples, and [Local Outlier 
Detection](lof.md) is an algorithm to detect outliers. On the other hand, 
change-points indicate "global" change on a wider scale in terms of 
characteristics of data points.
+
+In this page, we specially focus on change-point detection. More concretely, 
the following sections introduce a way to detect change-points on Hivemall, by 
using a specific technique named Singular Spectrum Transformation (SST).
+
+# Data Preparation
+
+## Get Twitter's data
+
+We use time series data points provided by Twitter in the following article: 
[Introducing practical and robust anomaly detection in a time 
series](https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series).
 In fact, the dataset is originally created for R, but we can get CSV version 
of the same data from 
[HERE](https://github.com/apache/incubator-hivemall/blob/master/core/src/test/resources/hivemall/anomaly/twitter.csv.gz?raw=true).
+
+Once you uncompressed the downloaded `.gz` file, you can see a CSV file:
+
+```
+$ head twitter.csv
+182.478
+176.231
+183.917
+177.798
+165.469
+181.878
+184.502
+183.303
+177.578
+171.641
+```
+
+These values are sequential data points. Our goal is to detect change-points 
in the samples. Here, let us insert a dummy timestamp into each line as follows:
+
+```
+$ awk '{printf "%d#%s\n", NR, $0}' < twitter.csv > twitter.t
+```
+
+```
+$ head twitter.t
+1#182.478
+2#176.231
+3#183.917
+4#177.798
+5#165.469
+6#181.878
+7#184.502
+8#183.303
+9#177.578
+10#171.641
+```
+
+Now, Hive can understand sequence of the samples by just looking dummy 
timestamp.
+
+## Importing data as a Hive table
+
+### Create a Hive table
+
+You first need to launch a Hive console and run the following operations:
+
+```
+create database twitter;
+use twitter;
+```
+
+```sql
+CREATE EXTERNAL TABLE timeseries (
+  num INT,
+  value DOUBLE
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+STORED AS TEXTFILE
+LOCATION '/dataset/twitter/timeseries';
+```
+
+### Load data into the table
+
+Next, the `.t` file we have generated before can be loaded to the table by:
+
+```
+$ hadoop fs -put twitter.t /dataset/twitter/timeseries
+```
+
+`timeseries` table in `twitter` database should be:
+
+| num | value |
+|:---:|:---:|
+|1|182.478|
+|2|176.231|
+|3|183.917|
+|4|177.798|
+|5|165.469|
+|...|...|
+
+# Change-Point Detection using SST
+
+We are now ready to detect change-points. A UDF `sst()` takes a `double` value 
as the first argument, and you can set options in the second argument. 
+
+What the following query does is to detect change-points from a `value` column 
in the `timeseries` table. An option `"-threshold 0.005"` means that a data 
point is detected as a change-point if its score is greater than 0.005.
+
+```
+use twitter;
+```
+
+```sql
+SELECT
+  num,
+  sst(value, "-threshold 0.005") AS result
+FROM
+  timeseries
+ORDER BY num ASC
+;
+```
+
+For instance, partial outputs obtained as a result of this query are:
+
+| num | result |
+|:---:|:---|
+|...|...|
+|7551  |  {"changepoint_score":0.00453049288071683,"is_changepoint":false}|
+|7552 |   {"changepoint_score":0.004711244102524104,"is_changepoint":false}|
+|7553  |  {"changepoint_score":0.004814871928978115,"is_changepoint":false}|
+|7554 |   {"changepoint_score":0.004968089640799422,"is_changepoint":false}|
+|7555 |   {"changepoint_score":0.005709056330104878,"is_changepoint":true}|
+|7556   | {"changepoint_score":0.0044279766655132,"is_changepoint":false}|
+|7557  |  {"changepoint_score":0.0034694956722586268,"is_changepoint":false}|
+|7558  |  {"changepoint_score":0.002549056569322694,"is_changepoint":false}|
+|7559  |  {"changepoint_score":0.0017395109108403473,"is_changepoint":false}|
+|7560  |  {"changepoint_score":0.0010629833145070489,"is_changepoint":false}|
+|...|...|
+
+Obviously, the 7555-th sample is detected as a change-point in this example.
\ No newline at end of file

Reply via email to