Hi,

I am one of the interns working on the "Streaming Machine Learning on WSO2
CEP" Project. I have built a Siddhi extension to CEP using Apache SAMOA
machine learning.

“SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining
big data streams. Currently, this is an apache incubator project.Samoa is
written in Java  and it is open source, and available at
http://samoa-project.net under the Apache Software License version 2.0.

As a framework : it allows algorithm developers to abstract from the
underlying execution engine, and therefore reuse their code on different
engines. It features a pluggable architecture that allows it to run on
several distributed stream processing engines such as Storm, S4, and Samza.
This capability is achieved by designing a minimal API that captures the
essence of modern DSPEs. This API also allows to easily write new bindings
to port SAMOA to new execution engines.

As a library: SAMOA contains implementations of state-of-the-art algorithms
for distributed machine learning on streams. Currently, SAMOA implemented
vertical Hoeffding tree for classification, distributed k-means algorithm
for clustering, and adaptive model rules(Have two implementations) for
regression, as well as programming abstractions to develop new
algorithms.The library also includes meta-algorithms such as bagging and
boosting(ensemble techniques) for improve the predictive force.”

I created a siddhi extension using samoa as a machine learning algorithm
library. It contains classification, regression and clustering extensions
and SAMOA local mode(not the Distributed version) without a cluster. Also,
these extensions provide different API calls.

[image: Streaming Machine learning SAMOA integrate to CEP (Abstract).jpg]

Main architecture



After creating the extensions I tested streaming machine learning accuracy
using samoa  and batch processing accuracy using weka machine learner.

Classification (Vertical Hoeffding Tree)Using MAGIC Gamma Telescope Data
Set <https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope>

18000 data

Batch Process (Using WSO2 ML)

Streaming

Class 1

Class 2

Class 1

Class 2

Accuracy

82.72

73.4

F1-Score

87.09

73.86

80.41

58.53

The accuracy of the batch process is higher than samoa streaming process.
If that stream has not drifted then the streaming process accuracy
increases with the time and it will get a stable state.

Regression (AMRules) Using Combined Cycle Power Plant Data Set (CCPP)
<https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant>

Data Points 9500

Samoa (Adaptive Model Rules Regressor)

Weka

linearRegression

M5Rules

Mean absolute error

3.68

3.63

3.06

Root mean squared error

6.69

4.56

3.99

Relative absolute error

24.7

24.43

20.61

Root relative squared error

37.8

26.7

23.4

I did regression test using 2 datasets and classification test using 2 data
sets. According to those results I saw there is no huge error between
streaming and batch process. Comparing with classification and clustering,
streaming regression and batch regression have similar error rates.
Therefore I think streaming ml  is really suitable for regression.

Clustering (k-means) Using 3D Road Network (North Jutland, Denmark) Data Set
<https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+%28North+Jutland,+Denmark%29>

Data points 434874

Attribute_1

Attribute_2

Attribute_3

Attribute_4

Samoa

Weka

Samoa

Weka

Samoa

Weka

Samoa

Weka

Center_0

100098819.2

111598410.7

9.77

10.2

57.16

57.37

21.23

19.4

Center_1

36598276.23

35877429.78

9.72

9.88

57.05

56.87

21.87

22.47

Center_2

138161280.2

116561030.9

9.57

9.35

57.09

57.15

23.15

23.17

Mean

97869870.26

9.7318

57.0838

22.1854

10 Iterations, K-Means algorithm

In streaming clustering the range of the cluster centers is thinner than
batch process cluster centers range.

References

[1] - Samoa research paper
http://www.jmlr.org/papers/volume16/morales15a/morales15a.pdf

[2] - Samoa docs  http://samoa.incubator.apache.org/

[3] - Git repository  https://github.com/Jayancv/streaingML
<https://github.com/Jayancv/streamingML>

[4] - Statistics of tests
https://docs.google.com/a/wso2.com/spreadsheets/d/1uROw0gGIu_Ht0J0YnSOHoH600ZnJG9ejp9ztMaXA09s/edit?usp=sharing



-- 

Regards,

Jayan Vidanapathirana
Intern Software Engineer,
WSO2.
mobile +94715594516
<http://www.linkedin.com/in/>www.linkedin.com/in/jayancv
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to