Thanks Ashen! Few diagrams will help readers to understand the algorithm
better.

On Wed, Sep 23, 2015 at 6:03 PM, Ashen Weerathunga <[email protected]> wrote:

> Hi all,
>
> I am currently doing the integration of Anomaly detection feature to the
> WSO2 ML. There are some anomaly/fraud detection features already
> implemented in CEP/DAS using different approaches. But this will be done
> using a machine learning approach which is K means clustering. Basically I
> have used K means algorithm provided by Apache Spark MLib which is already
> using in WSO2 ML.
>
> This feature supports both labeled and unlabeled data. User can build a
> model using existing data and use that for prediction.
>
> The main steps of this feature are as follows,
>
>    - After doing the preprocessing steps user will have to select the
>    algorithm. There will be two algorithms under Anomaly Detection category
>       - K Means with Unlabeled data
>       - K Means with Labeled data - If user have labeled data user can go
>       for this option
>       - If user select K Means with labeled data option user should input
>    Normal label(s) values and train data fraction as well.
>    - In the next step user will have to input three parameters
>       - Maximum number of iterations
>       - Number of normal clusters
>       - Percentile value
>       - Then the model will be build using those parameters
>    - A model summery will be provided for labeled data option which shows
>    the model accuracy measures,confusion matrix, etc.
>    - In the prediction part user will have two options as to input new
>    data as a csv or tsv file or manually enter new data values. As the
>    prediction it will show whether the new data point is an anomaly or not.
>
> The methodology used is as follows,
>
>    - First the dataset will be clustered using K means algorithm
>    according to hyper parameters that user provided.
>    - Since in the real world scenario of anomaly detection the
>    positive(anomaly) instances are vary rare, we assume that those anomalies
>    will be in outside from the clusters.
>    - So we can detect them by calculating the cluster boundaries. This is
>    how we identify the cluster boundaries,
>       - First calculate all the distances between data points and their
>       respective cluster centers.
>       - Then select the percentile value from distances of each clusters
>       as their cluster boundaries.
>    - When a new data point comes the closest cluster center will be
>    calculated by K means predict function.
>    - Then the distance between new data point and It's cluster center
>    will be calculated. If it is less than the percentile distance value it is
>    considered as a normal data. If it is grater than the percentile distance
>    value it is considered as a anomaly since it is in outside the cluster.
>
> Most of the work have completed by now. Please let me know if there are
> any issues or improvements to be done.
> https://github.com/ashensw/carbon-ml/tree/fraud_detection
>
> Thanks and Regards,
> Ashen
>
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: [email protected]
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to