Hi Jayan, Is there a way to output the accuracy of a specific model within a siddhi execution plan?
On Wed, Nov 30, 2016 at 4:38 PM, Jayan Vidanapathirana <[email protected]> wrote: > Hi, > > > I am one of the interns working on the "Streaming Machine Learning on WSO2 > CEP" Project. I have built a Siddhi extension to CEP using Apache SAMOA > machine learning. > > “SAMOA (Scalable Advanced Massive Online Analysis) is a platform for > mining big data streams. Currently, this is an apache incubator > project.Samoa is written in Java and it is open source, and available at > http://samoa-project.net under the Apache Software License version 2.0. > > As a framework : it allows algorithm developers to abstract from the > underlying execution engine, and therefore reuse their code on different > engines. It features a pluggable architecture that allows it to run on > several distributed stream processing engines such as Storm, S4, and Samza. > This capability is achieved by designing a minimal API that captures the > essence of modern DSPEs. This API also allows to easily write new bindings > to port SAMOA to new execution engines. > > As a library: SAMOA contains implementations of state-of-the-art > algorithms for distributed machine learning on streams. Currently, SAMOA > implemented vertical Hoeffding tree for classification, distributed k-means > algorithm for clustering, and adaptive model rules(Have two > implementations) for regression, as well as programming abstractions to > develop new algorithms.The library also includes meta-algorithms such as > bagging and boosting(ensemble techniques) for improve the predictive force.” > > I created a siddhi extension using samoa as a machine learning algorithm > library. It contains classification, regression and clustering extensions > and SAMOA local mode(not the Distributed version) without a cluster. Also, > these extensions provide different API calls. > > [image: Streaming Machine learning SAMOA integrate to CEP (Abstract).jpg] > > Main architecture > > > > After creating the extensions I tested streaming machine learning accuracy > using samoa and batch processing accuracy using weka machine learner. > > Classification (Vertical Hoeffding Tree)Using MAGIC Gamma Telescope Data > Set <https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope> > > 18000 data > > Batch Process (Using WSO2 ML) > > Streaming > > Class 1 > > Class 2 > > Class 1 > > Class 2 > > Accuracy > > 82.72 > > 73.4 > > F1-Score > > 87.09 > > 73.86 > > 80.41 > > 58.53 > > The accuracy of the batch process is higher than samoa streaming process. > If that stream has not drifted then the streaming process accuracy > increases with the time and it will get a stable state. > > Regression (AMRules) Using Combined Cycle Power Plant Data Set (CCPP) > <https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant> > > Data Points 9500 > > Samoa (Adaptive Model Rules Regressor) > > Weka > > linearRegression > > M5Rules > > Mean absolute error > > 3.68 > > 3.63 > > 3.06 > > Root mean squared error > > 6.69 > > 4.56 > > 3.99 > > Relative absolute error > > 24.7 > > 24.43 > > 20.61 > > Root relative squared error > > 37.8 > > 26.7 > > 23.4 > > I did regression test using 2 datasets and classification test using 2 > data sets. According to those results I saw there is no huge error between > streaming and batch process. Comparing with classification and clustering, > streaming regression and batch regression have similar error rates. > Therefore I think streaming ml is really suitable for regression. > > Clustering (k-means) Using 3D Road Network (North Jutland, Denmark) Data > Set > <https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+%28North+Jutland,+Denmark%29> > > Data points 434874 > > Attribute_1 > > Attribute_2 > > Attribute_3 > > Attribute_4 > > Samoa > > Weka > > Samoa > > Weka > > Samoa > > Weka > > Samoa > > Weka > > Center_0 > > 100098819.2 > > 111598410.7 > > 9.77 > > 10.2 > > 57.16 > > 57.37 > > 21.23 > > 19.4 > > Center_1 > > 36598276.23 > > 35877429.78 > > 9.72 > > 9.88 > > 57.05 > > 56.87 > > 21.87 > > 22.47 > > Center_2 > > 138161280.2 > > 116561030.9 > > 9.57 > > 9.35 > > 57.09 > > 57.15 > > 23.15 > > 23.17 > > Mean > > 97869870.26 > > 9.7318 > > 57.0838 > > 22.1854 > > 10 Iterations, K-Means algorithm > > In streaming clustering the range of the cluster centers is thinner than > batch process cluster centers range. > > References > > [1] - Samoa research paper http://www.jmlr.org/papers/ > volume16/morales15a/morales15a.pdf > > [2] - Samoa docs http://samoa.incubator.apache.org/ > > [3] - Git repository https://github.com/Jayancv/streaingML > <https://github.com/Jayancv/streamingML> > > [4] - Statistics of tests https://docs.google.com/a/ > wso2.com/spreadsheets/d/1uROw0gGIu_Ht0J0YnSOHoH600ZnJG9ejp9ztMaXA > 09s/edit?usp=sharing > > > -- > > Regards, > > Jayan Vidanapathirana > Intern Software Engineer, > WSO2. > mobile +94715594516 <+94%2071%20559%204516> > <http://www.linkedin.com/in/>www.linkedin.com/in/jayancv > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- Thanks & Regards, Fazlan Nazeem *Software Engineer* *WSO2 Inc* Mobile : +94772338839 <%2B94%20%280%29%20773%20451194> [email protected]
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
