Hi all,

I am currently doing the integration of Anomaly detection feature to the
WSO2 ML. There are some anomaly/fraud detection features already
implemented in CEP/DAS using different approaches. But this will be done
using a machine learning approach which is K means clustering. Basically I
have used K means algorithm provided by Apache Spark MLib which is already
using in WSO2 ML.

This feature supports both labeled and unlabeled data. User can build a
model using existing data and use that for prediction.

The main steps of this feature are as follows,

   - After doing the preprocessing steps user will have to select the
   algorithm. There will be two algorithms under Anomaly Detection category
      - K Means with Unlabeled data
      - K Means with Labeled data - If user have labeled data user can go
      for this option
      - If user select K Means with labeled data option user should input
   Normal label(s) values and train data fraction as well.
   - In the next step user will have to input three parameters
      - Maximum number of iterations
      - Number of normal clusters
      - Percentile value
      - Then the model will be build using those parameters
   - A model summery will be provided for labeled data option which shows
   the model accuracy measures,confusion matrix, etc.
   - In the prediction part user will have two options as to input new data
   as a csv or tsv file or manually enter new data values. As the prediction
   it will show whether the new data point is an anomaly or not.

The methodology used is as follows,

   - First the dataset will be clustered using K means algorithm according
   to hyper parameters that user provided.
   - Since in the real world scenario of anomaly detection the
   positive(anomaly) instances are vary rare, we assume that those anomalies
   will be in outside from the clusters.
   - So we can detect them by calculating the cluster boundaries. This is
   how we identify the cluster boundaries,
      - First calculate all the distances between data points and their
      respective cluster centers.
      - Then select the percentile value from distances of each clusters as
      their cluster boundaries.
   - When a new data point comes the closest cluster center will be
   calculated by K means predict function.
   - Then the distance between new data point and It's cluster center will
   be calculated. If it is less than the percentile distance value it is
   considered as a normal data. If it is grater than the percentile distance
   value it is considered as a anomaly since it is in outside the cluster.

Most of the work have completed by now. Please let me know if there are any
issues or improvements to be done.
https://github.com/ashensw/carbon-ml/tree/fraud_detection

Thanks and Regards,
Ashen

-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: [email protected]
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to