Variables of the above diagram. - Cc1, Cc2, Cc3 - Cluster centers
- r1 - ith percentile distance of distances of all the points of cluster 1 to their cluster center (Cc1) (this is considered as the boundary of cluster 1) - d1 - distance between particular data point and it's closest cluster center (Cc1) On Thu, Sep 24, 2015 at 11:25 AM, Ashen Weerathunga <[email protected]> wrote: > Thanks for the suggestion! > > This diagram shows how the algorithm detect anomaly behaviors. As in the > diagram when we do the K means clustering there will be set of clusters of > normal data and some deviated points which behave as anomalies. since we > consider a percentile distance to identify cluster boundaries we can > eliminate those anomaly data from clusters. so when a new data point comes > closest cluster center will be calculated and after that comparing > distances we can identify whether it is belong to the cluster or not. If it > is not algorithms detect it as a anomaly data. > > [image: Inline image 3] > Hope this will give a more clear view about the algorithm. > > Thanks, > Ashen > > On Wed, Sep 23, 2015 at 6:11 PM, Nirmal Fernando <[email protected]> wrote: > >> Thanks Ashen! Few diagrams will help readers to understand the algorithm >> better. >> >> On Wed, Sep 23, 2015 at 6:03 PM, Ashen Weerathunga <[email protected]> >> wrote: >> >>> Hi all, >>> >>> I am currently doing the integration of Anomaly detection feature to the >>> WSO2 ML. There are some anomaly/fraud detection features already >>> implemented in CEP/DAS using different approaches. But this will be done >>> using a machine learning approach which is K means clustering. Basically I >>> have used K means algorithm provided by Apache Spark MLib which is already >>> using in WSO2 ML. >>> >>> This feature supports both labeled and unlabeled data. User can build a >>> model using existing data and use that for prediction. >>> >>> The main steps of this feature are as follows, >>> >>> - After doing the preprocessing steps user will have to select the >>> algorithm. There will be two algorithms under Anomaly Detection category >>> - K Means with Unlabeled data >>> - K Means with Labeled data - If user have labeled data user can >>> go for this option >>> - If user select K Means with labeled data option user should >>> input Normal label(s) values and train data fraction as well. >>> - In the next step user will have to input three parameters >>> - Maximum number of iterations >>> - Number of normal clusters >>> - Percentile value >>> - Then the model will be build using those parameters >>> - A model summery will be provided for labeled data option which >>> shows the model accuracy measures,confusion matrix, etc. >>> - In the prediction part user will have two options as to input new >>> data as a csv or tsv file or manually enter new data values. As the >>> prediction it will show whether the new data point is an anomaly or not. >>> >>> The methodology used is as follows, >>> >>> - First the dataset will be clustered using K means algorithm >>> according to hyper parameters that user provided. >>> - Since in the real world scenario of anomaly detection the >>> positive(anomaly) instances are vary rare, we assume that those anomalies >>> will be in outside from the clusters. >>> - So we can detect them by calculating the cluster boundaries. This >>> is how we identify the cluster boundaries, >>> - First calculate all the distances between data points and their >>> respective cluster centers. >>> - Then select the percentile value from distances of each >>> clusters as their cluster boundaries. >>> - When a new data point comes the closest cluster center will be >>> calculated by K means predict function. >>> - Then the distance between new data point and It's cluster center >>> will be calculated. If it is less than the percentile distance value it >>> is >>> considered as a normal data. If it is grater than the percentile distance >>> value it is considered as a anomaly since it is in outside the cluster. >>> >>> Most of the work have completed by now. Please let me know if there are >>> any issues or improvements to be done. >>> https://github.com/ashensw/carbon-ml/tree/fraud_detection >>> >>> Thanks and Regards, >>> Ashen >>> >>> -- >>> *Ashen Weerathunga* >>> Software Engineer - Intern >>> WSO2 Inc.: http://wso2.com >>> lean.enterprise.middleware >>> >>> Email: [email protected] >>> Mobile: +94 716042995 <94716042995> >>> LinkedIn: >>> *http://lk.linkedin.com/in/ashenweerathunga >>> <http://lk.linkedin.com/in/ashenweerathunga>* >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > *Ashen Weerathunga* > Software Engineer - Intern > WSO2 Inc.: http://wso2.com > lean.enterprise.middleware > > Email: [email protected] > Mobile: +94 716042995 <94716042995> > LinkedIn: > *http://lk.linkedin.com/in/ashenweerathunga > <http://lk.linkedin.com/in/ashenweerathunga>* > -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: [email protected] Mobile: +94 716042995 <94716042995> LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga <http://lk.linkedin.com/in/ashenweerathunga>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
