Hi,
I am one of the interns working on the "Streaming Machine Learning on WSO2 CEP" Project. I have built a Siddhi extension to CEP using Apache SAMOA machine learning. “SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. Currently, this is an apache incubator project.Samoa is written in Java and it is open source, and available at http://samoa-project.net under the Apache Software License version 2.0. As a framework : it allows algorithm developers to abstract from the underlying execution engine, and therefore reuse their code on different engines. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. This capability is achieved by designing a minimal API that captures the essence of modern DSPEs. This API also allows to easily write new bindings to port SAMOA to new execution engines. As a library: SAMOA contains implementations of state-of-the-art algorithms for distributed machine learning on streams. Currently, SAMOA implemented vertical Hoeffding tree for classification, distributed k-means algorithm for clustering, and adaptive model rules(Have two implementations) for regression, as well as programming abstractions to develop new algorithms.The library also includes meta-algorithms such as bagging and boosting(ensemble techniques) for improve the predictive force.” I created a siddhi extension using samoa as a machine learning algorithm library. It contains classification, regression and clustering extensions and SAMOA local mode(not the Distributed version) without a cluster. Also, these extensions provide different API calls. [image: Streaming Machine learning SAMOA integrate to CEP (Abstract).jpg] Main architecture After creating the extensions I tested streaming machine learning accuracy using samoa and batch processing accuracy using weka machine learner. Classification (Vertical Hoeffding Tree)Using MAGIC Gamma Telescope Data Set <https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope> 18000 data Batch Process (Using WSO2 ML) Streaming Class 1 Class 2 Class 1 Class 2 Accuracy 82.72 73.4 F1-Score 87.09 73.86 80.41 58.53 The accuracy of the batch process is higher than samoa streaming process. If that stream has not drifted then the streaming process accuracy increases with the time and it will get a stable state. Regression (AMRules) Using Combined Cycle Power Plant Data Set (CCPP) <https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant> Data Points 9500 Samoa (Adaptive Model Rules Regressor) Weka linearRegression M5Rules Mean absolute error 3.68 3.63 3.06 Root mean squared error 6.69 4.56 3.99 Relative absolute error 24.7 24.43 20.61 Root relative squared error 37.8 26.7 23.4 I did regression test using 2 datasets and classification test using 2 data sets. According to those results I saw there is no huge error between streaming and batch process. Comparing with classification and clustering, streaming regression and batch regression have similar error rates. Therefore I think streaming ml is really suitable for regression. Clustering (k-means) Using 3D Road Network (North Jutland, Denmark) Data Set <https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+%28North+Jutland,+Denmark%29> Data points 434874 Attribute_1 Attribute_2 Attribute_3 Attribute_4 Samoa Weka Samoa Weka Samoa Weka Samoa Weka Center_0 100098819.2 111598410.7 9.77 10.2 57.16 57.37 21.23 19.4 Center_1 36598276.23 35877429.78 9.72 9.88 57.05 56.87 21.87 22.47 Center_2 138161280.2 116561030.9 9.57 9.35 57.09 57.15 23.15 23.17 Mean 97869870.26 9.7318 57.0838 22.1854 10 Iterations, K-Means algorithm In streaming clustering the range of the cluster centers is thinner than batch process cluster centers range. References [1] - Samoa research paper http://www.jmlr.org/papers/volume16/morales15a/morales15a.pdf [2] - Samoa docs http://samoa.incubator.apache.org/ [3] - Git repository https://github.com/Jayancv/streaingML <https://github.com/Jayancv/streamingML> [4] - Statistics of tests https://docs.google.com/a/wso2.com/spreadsheets/d/1uROw0gGIu_Ht0J0YnSOHoH600ZnJG9ejp9ztMaXA09s/edit?usp=sharing -- Regards, Jayan Vidanapathirana Intern Software Engineer, WSO2. mobile +94715594516 <http://www.linkedin.com/in/>www.linkedin.com/in/jayancv
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture