Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Nirmal Fernando Tue, 25 Aug 2015 02:45:13 -0700

Can we see the code too?

On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com> wrote:


> Hi all,
>
> I am currently working on fraud detection project. I was able to cluster
> the KDD cup 99 network anomaly detection dataset using apache spark k means
> algorithm. So far I was able to achieve 99% accuracy rate from this
> dataset.The steps I have followed during the process are mentioned below.
>
>    - Separate the dataset into two parts (normal data and anomaly data)
>    by filtering the label
>    - Splits each two parts of data as follows
>       - normal data
>       - 65% - to train the model
>          - 15% - to optimize the model by adjusting hyper parameters
>          - 20% - to evaluate the model
>       - anomaly data
>          - 65% - no use
>          - 15% - to optimize the model by adjusting hyper parameters
>          - 20% - to evaluate the model
>       - Prepossess the dataset
>       - Drop out non numerical features since k means can only handle
>       numerical values
>       - Normalize all the values to 1-0 range
>       - Cluster the 65% of normal data using Apache spark K means and
>    build the model (15% of both normal and anomaly data were used to tune the
>    hyper parameters such as k, percentile etc. to get an optimized model)
>    - Finally evaluate the model using 20% of both normal and anomaly data.
>
> Method of identifying a fraud as follows,
>
>    - When a new data point comes, get the closest cluster center by using
>    k means predict function.
>    - I have calculate 98th percentile distance for each cluster. (98 was
>    the best value I got by tuning the model with different values)
>    - Then I checked whether the distance of new data point with the given
>    cluster center is less than or grater than the 98th percentile of that
>    cluster. If it is less than the percentile it is considered as a normal
>    data. If it is grater than the percentile it is considered as a fraud since
>    it is in outside the cluster.
>
> Our next step is to integrate this feature to ML product and try out it
> with more realistic dataset. A summery of results I have obtained using
> 98th percentile during the process is attached with this.
>
>
> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>
> Thanks and Regards,
> Ashen
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: as...@wso2.com
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Reply via email to