Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Ashen Weerathunga Tue, 25 Aug 2015 02:50:25 -0700

Hi all,

This is the source code of the project.
https://github.com/ashensw/Spark-KMeans-fraud-detection


Best Regards,
Ashen

On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga <as...@wso2.com> wrote:

> Thanks all for the suggestions,
>
> There are few assumptions I have made,
>
>    - Clusters are uniform
>    - Fraud data always will be outliers to the normal clusters
>    - Clusters are not intersect with each other
>    - I have given the number of Iterations as 100. So I assume that 100
>    iterations will be enough to make almost stable clusters
>
> @Maheshakya,
>
> In this dataset consist of higher amount of anomaly data than the normal
> data. But the practical scenario will be otherwise. Because of that It will
> be more unrealistic if I use those 65% of anomaly data to evaluate the
> model. The amount of normal data I used to build the model is also less
> than those 65% of anomaly data. Yes since our purpose is to detect
> anomalies It would be good to try with more anomaly data to evaluate the
> model.Thanks and I'll try to use them also.
>
> Best Regards,
>
> Ashen
>
> On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena <
> mahesha...@wso2.com> wrote:
>
>> Is there any particular reason why you are putting aside 65% of anomalous
>> data at the evaluation? Since there is an obvious imbalance when the
>> numbers of normal and abnormal cases are taken into account, you will get
>> greater accuracy at the evaluation because a model tends to produce more
>> accurate results for the class with the greater size. But it's not the case
>> for the class of smaller size. With less number of records, it wont make
>> much impact on the accuracy. Hence IMO, it would be better if you could
>> evaluate with more anomalous data.
>> i.e. number of records of each class needs to be roughly equal.
>>
>> Best regards
>>
>> On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com>
>> wrote:
>>
>>> Hi Ashen,
>>>
>>> It would be better if you can add the assumptions you make in this
>>> process (uniform clusters etc). It will make the process more clear IMO.
>>>
>>> Regards,
>>> CD
>>>
>>> On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com>
>>> wrote:
>>>
>>>> Can we see the code too?
>>>>
>>>> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am currently working on fraud detection project. I was able to
>>>>> cluster the KDD cup 99 network anomaly detection dataset using apache 
>>>>> spark
>>>>> k means algorithm. So far I was able to achieve 99% accuracy rate from 
>>>>> this
>>>>> dataset.The steps I have followed during the process are mentioned below.
>>>>>
>>>>>    - Separate the dataset into two parts (normal data and anomaly
>>>>>    data) by filtering the label
>>>>>    - Splits each two parts of data as follows
>>>>>       - normal data
>>>>>       - 65% - to train the model
>>>>>          - 15% - to optimize the model by adjusting hyper parameters
>>>>>          - 20% - to evaluate the model
>>>>>       - anomaly data
>>>>>          - 65% - no use
>>>>>          - 15% - to optimize the model by adjusting hyper parameters
>>>>>          - 20% - to evaluate the model
>>>>>       - Prepossess the dataset
>>>>>       - Drop out non numerical features since k means can only handle
>>>>>       numerical values
>>>>>       - Normalize all the values to 1-0 range
>>>>>       - Cluster the 65% of normal data using Apache spark K means and
>>>>>    build the model (15% of both normal and anomaly data were used to tune 
>>>>> the
>>>>>    hyper parameters such as k, percentile etc. to get an optimized model)
>>>>>    - Finally evaluate the model using 20% of both normal and anomaly
>>>>>    data.
>>>>>
>>>>> Method of identifying a fraud as follows,
>>>>>
>>>>>    - When a new data point comes, get the closest cluster center by
>>>>>    using k means predict function.
>>>>>    - I have calculate 98th percentile distance for each cluster. (98
>>>>>    was the best value I got by tuning the model with different values)
>>>>>    - Then I checked whether the distance of new data point with the
>>>>>    given cluster center is less than or grater than the 98th percentile of
>>>>>    that cluster. If it is less than the percentile it is considered as a
>>>>>    normal data. If it is grater than the percentile it is considered as a
>>>>>    fraud since it is in outside the cluster.
>>>>>
>>>>> Our next step is to integrate this feature to ML product and try out
>>>>> it with more realistic dataset. A summery of results I have obtained using
>>>>> 98th percentile during the process is attached with this.
>>>>>
>>>>>
>>>>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>>>>>
>>>>> Thanks and Regards,
>>>>> Ashen
>>>>> --
>>>>> *Ashen Weerathunga*
>>>>> Software Engineer - Intern
>>>>> WSO2 Inc.: http://wso2.com
>>>>> lean.enterprise.middleware
>>>>>
>>>>> Email: as...@wso2.com
>>>>> Mobile: +94 716042995 <94716042995>
>>>>> LinkedIn:
>>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *CD Athuraliya*
>>> Software Engineer
>>> WSO2, Inc.
>>> lean . enterprise . middleware
>>> Mobile: +94 716288847 <94716288847>
>>> LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter
>>> <https://twitter.com/cdathuraliya> | Blog
>>> <http://cdathuraliya.tumblr.com/>
>>>
>>
>>
>>
>> --
>> Pruthuvi Maheshakya Wijewardena
>> Software Engineer
>> WSO2 : http://wso2.com/
>> Email: mahesha...@wso2.com
>> Mobile: +94711228855
>>
>>
>>
>
>
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: as...@wso2.com
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: as...@wso2.com
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Reply via email to