Thanks Ashen! Few diagrams will help readers to understand the algorithm better.
On Wed, Sep 23, 2015 at 6:03 PM, Ashen Weerathunga <[email protected]> wrote: > Hi all, > > I am currently doing the integration of Anomaly detection feature to the > WSO2 ML. There are some anomaly/fraud detection features already > implemented in CEP/DAS using different approaches. But this will be done > using a machine learning approach which is K means clustering. Basically I > have used K means algorithm provided by Apache Spark MLib which is already > using in WSO2 ML. > > This feature supports both labeled and unlabeled data. User can build a > model using existing data and use that for prediction. > > The main steps of this feature are as follows, > > - After doing the preprocessing steps user will have to select the > algorithm. There will be two algorithms under Anomaly Detection category > - K Means with Unlabeled data > - K Means with Labeled data - If user have labeled data user can go > for this option > - If user select K Means with labeled data option user should input > Normal label(s) values and train data fraction as well. > - In the next step user will have to input three parameters > - Maximum number of iterations > - Number of normal clusters > - Percentile value > - Then the model will be build using those parameters > - A model summery will be provided for labeled data option which shows > the model accuracy measures,confusion matrix, etc. > - In the prediction part user will have two options as to input new > data as a csv or tsv file or manually enter new data values. As the > prediction it will show whether the new data point is an anomaly or not. > > The methodology used is as follows, > > - First the dataset will be clustered using K means algorithm > according to hyper parameters that user provided. > - Since in the real world scenario of anomaly detection the > positive(anomaly) instances are vary rare, we assume that those anomalies > will be in outside from the clusters. > - So we can detect them by calculating the cluster boundaries. This is > how we identify the cluster boundaries, > - First calculate all the distances between data points and their > respective cluster centers. > - Then select the percentile value from distances of each clusters > as their cluster boundaries. > - When a new data point comes the closest cluster center will be > calculated by K means predict function. > - Then the distance between new data point and It's cluster center > will be calculated. If it is less than the percentile distance value it is > considered as a normal data. If it is grater than the percentile distance > value it is considered as a anomaly since it is in outside the cluster. > > Most of the work have completed by now. Please let me know if there are > any issues or improvements to be done. > https://github.com/ashensw/carbon-ml/tree/fraud_detection > > Thanks and Regards, > Ashen > > -- > *Ashen Weerathunga* > Software Engineer - Intern > WSO2 Inc.: http://wso2.com > lean.enterprise.middleware > > Email: [email protected] > Mobile: +94 716042995 <94716042995> > LinkedIn: > *http://lk.linkedin.com/in/ashenweerathunga > <http://lk.linkedin.com/in/ashenweerathunga>* > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
