Variables of the above diagram.

   - Cc1, Cc2, Cc3 - Cluster centers


   - r1 - ith percentile distance of distances of all the points of cluster
   1 to their cluster center (Cc1)                                   (this is
   considered as the boundary of cluster 1)


   - d1 - distance between particular data point and it's closest cluster
   center (Cc1)


On Thu, Sep 24, 2015 at 11:25 AM, Ashen Weerathunga <[email protected]> wrote:

> Thanks for the suggestion!
>
> This diagram shows how the algorithm detect anomaly behaviors. As in the
> diagram when we do the K means clustering there will be set of clusters of
> normal data and some deviated points which behave as anomalies. since we
> consider a percentile distance to identify cluster boundaries we can
> eliminate those anomaly data from clusters. so when a new data point comes
> closest cluster center will be calculated and after that comparing
> distances we can identify whether it is belong to the cluster or not. If it
> is not algorithms detect it as a anomaly data.
>
> [image: Inline image 3]
> Hope this will give a more clear view about the algorithm.
>
> Thanks,
> Ashen
>
> On Wed, Sep 23, 2015 at 6:11 PM, Nirmal Fernando <[email protected]> wrote:
>
>> Thanks Ashen! Few diagrams will help readers to understand the algorithm
>> better.
>>
>> On Wed, Sep 23, 2015 at 6:03 PM, Ashen Weerathunga <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am currently doing the integration of Anomaly detection feature to the
>>> WSO2 ML. There are some anomaly/fraud detection features already
>>> implemented in CEP/DAS using different approaches. But this will be done
>>> using a machine learning approach which is K means clustering. Basically I
>>> have used K means algorithm provided by Apache Spark MLib which is already
>>> using in WSO2 ML.
>>>
>>> This feature supports both labeled and unlabeled data. User can build a
>>> model using existing data and use that for prediction.
>>>
>>> The main steps of this feature are as follows,
>>>
>>>    - After doing the preprocessing steps user will have to select the
>>>    algorithm. There will be two algorithms under Anomaly Detection category
>>>       - K Means with Unlabeled data
>>>       - K Means with Labeled data - If user have labeled data user can
>>>       go for this option
>>>       - If user select K Means with labeled data option user should
>>>    input Normal label(s) values and train data fraction as well.
>>>    - In the next step user will have to input three parameters
>>>       - Maximum number of iterations
>>>       - Number of normal clusters
>>>       - Percentile value
>>>       - Then the model will be build using those parameters
>>>    - A model summery will be provided for labeled data option which
>>>    shows the model accuracy measures,confusion matrix, etc.
>>>    - In the prediction part user will have two options as to input new
>>>    data as a csv or tsv file or manually enter new data values. As the
>>>    prediction it will show whether the new data point is an anomaly or not.
>>>
>>> The methodology used is as follows,
>>>
>>>    - First the dataset will be clustered using K means algorithm
>>>    according to hyper parameters that user provided.
>>>    - Since in the real world scenario of anomaly detection the
>>>    positive(anomaly) instances are vary rare, we assume that those anomalies
>>>    will be in outside from the clusters.
>>>    - So we can detect them by calculating the cluster boundaries. This
>>>    is how we identify the cluster boundaries,
>>>       - First calculate all the distances between data points and their
>>>       respective cluster centers.
>>>       - Then select the percentile value from distances of each
>>>       clusters as their cluster boundaries.
>>>    - When a new data point comes the closest cluster center will be
>>>    calculated by K means predict function.
>>>    - Then the distance between new data point and It's cluster center
>>>    will be calculated. If it is less than the percentile distance value it 
>>> is
>>>    considered as a normal data. If it is grater than the percentile distance
>>>    value it is considered as a anomaly since it is in outside the cluster.
>>>
>>> Most of the work have completed by now. Please let me know if there are
>>> any issues or improvements to be done.
>>> https://github.com/ashensw/carbon-ml/tree/fraud_detection
>>>
>>> Thanks and Regards,
>>> Ashen
>>>
>>> --
>>> *Ashen Weerathunga*
>>> Software Engineer - Intern
>>> WSO2 Inc.: http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> Email: [email protected]
>>> Mobile: +94 716042995 <94716042995>
>>> LinkedIn:
>>> *http://lk.linkedin.com/in/ashenweerathunga
>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: [email protected]
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: [email protected]
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to