Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Maheshakya Wijewardena Tue, 25 Aug 2015 02:47:24 -0700

Is there any particular reason why you are putting aside 65% of anomalous
data at the evaluation? Since there is an obvious imbalance when the
numbers of normal and abnormal cases are taken into account, you will get
greater accuracy at the evaluation because a model tends to produce more
accurate results for the class with the greater size. But it's not the case
for the class of smaller size. With less number of records, it wont make
much impact on the accuracy. Hence IMO, it would be better if you could
evaluate with more anomalous data.
i.e. number of records of each class needs to be roughly equal.


Best regards

On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com> wrote:

> Hi Ashen,
>
> It would be better if you can add the assumptions you make in this process
> (uniform clusters etc). It will make the process more clear IMO.
>
> Regards,
> CD
>
> On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com> wrote:
>
>> Can we see the code too?
>>
>> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am currently working on fraud detection project. I was able to cluster
>>> the KDD cup 99 network anomaly detection dataset using apache spark k means
>>> algorithm. So far I was able to achieve 99% accuracy rate from this
>>> dataset.The steps I have followed during the process are mentioned below.
>>>
>>>    - Separate the dataset into two parts (normal data and anomaly data)
>>>    by filtering the label
>>>    - Splits each two parts of data as follows
>>>       - normal data
>>>       - 65% - to train the model
>>>          - 15% - to optimize the model by adjusting hyper parameters
>>>          - 20% - to evaluate the model
>>>       - anomaly data
>>>          - 65% - no use
>>>          - 15% - to optimize the model by adjusting hyper parameters
>>>          - 20% - to evaluate the model
>>>       - Prepossess the dataset
>>>       - Drop out non numerical features since k means can only handle
>>>       numerical values
>>>       - Normalize all the values to 1-0 range
>>>       - Cluster the 65% of normal data using Apache spark K means and
>>>    build the model (15% of both normal and anomaly data were used to tune 
>>> the
>>>    hyper parameters such as k, percentile etc. to get an optimized model)
>>>    - Finally evaluate the model using 20% of both normal and anomaly
>>>    data.
>>>
>>> Method of identifying a fraud as follows,
>>>
>>>    - When a new data point comes, get the closest cluster center by
>>>    using k means predict function.
>>>    - I have calculate 98th percentile distance for each cluster. (98
>>>    was the best value I got by tuning the model with different values)
>>>    - Then I checked whether the distance of new data point with the
>>>    given cluster center is less than or grater than the 98th percentile of
>>>    that cluster. If it is less than the percentile it is considered as a
>>>    normal data. If it is grater than the percentile it is considered as a
>>>    fraud since it is in outside the cluster.
>>>
>>> Our next step is to integrate this feature to ML product and try out it
>>> with more realistic dataset. A summery of results I have obtained using
>>> 98th percentile during the process is attached with this.
>>>
>>>
>>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>>>
>>> Thanks and Regards,
>>> Ashen
>>> --
>>> *Ashen Weerathunga*
>>> Software Engineer - Intern
>>> WSO2 Inc.: http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> Email: as...@wso2.com
>>> Mobile: +94 716042995 <94716042995>
>>> LinkedIn:
>>> *http://lk.linkedin.com/in/ashenweerathunga
>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> *CD Athuraliya*
> Software Engineer
> WSO2, Inc.
> lean . enterprise . middleware
> Mobile: +94 716288847 <94716288847>
> LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter
> <https://twitter.com/cdathuraliya> | Blog
> <http://cdathuraliya.tumblr.com/>
>



-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 : http://wso2.com/
Email: mahesha...@wso2.com
Mobile: +94711228855

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Reply via email to