Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Sinnathamby Mahesan Tue, 25 Aug 2015 20:26:09 -0700

Hi Ashen
Thank you for sharing the results.
When I looked at the last column - anomaly data %
the best value 99.04% results in for 3 clusters with 100 iterations
and
the worst case   (28.12%) for 100 clusters with 100 iterations.


This would happen as k increases (with fixed number of iterations)

As I understand,
for some k,   100 iterations may be too many and
for some other k,   100 may be not enough(?),
(with all fixed number of data points)

You could limit the number of iterations by asserting a condition
so that it iterates until there is no-change in the centroids.
(You could try and see whether it makes any changes  on the results)

Perhaps 100 clusters may not be necessary.

What percentage of data does points fall in each cluster?
It may be helpful to decide on  the number of clusters.

Good Luck
=
Mahesan

On 25 August 2015 at 11:36, Ashen Weerathunga <as...@wso2.com> wrote:

> Hi all,
>
> I am currently working on fraud detection project. I was able to cluster
> the KDD cup 99 network anomaly detection dataset using apache spark k means
> algorithm. So far I was able to achieve 99% accuracy rate from this
> dataset.The steps I have followed during the process are mentioned below.
>
>    - Separate the dataset into two parts (normal data and anomaly data)
>    by filtering the label
>    - Splits each two parts of data as follows
>       - normal data
>       - 65% - to train the model
>          - 15% - to optimize the model by adjusting hyper parameters
>          - 20% - to evaluate the model
>       - anomaly data
>          - 65% - no use
>          - 15% - to optimize the model by adjusting hyper parameters
>          - 20% - to evaluate the model
>       - Prepossess the dataset
>       - Drop out non numerical features since k means can only handle
>       numerical values
>       - Normalize all the values to 1-0 range
>       - Cluster the 65% of normal data using Apache spark K means and
>    build the model (15% of both normal and anomaly data were used to tune the
>    hyper parameters such as k, percentile etc. to get an optimized model)
>    - Finally evaluate the model using 20% of both normal and anomaly data.
>
> Method of identifying a fraud as follows,
>
>    - When a new data point comes, get the closest cluster center by using
>    k means predict function.
>    - I have calculate 98th percentile distance for each cluster. (98 was
>    the best value I got by tuning the model with different values)
>    - Then I checked whether the distance of new data point with the given
>    cluster center is less than or grater than the 98th percentile of that
>    cluster. If it is less than the percentile it is considered as a normal
>    data. If it is grater than the percentile it is considered as a fraud since
>    it is in outside the cluster.
>
> Our next step is to integrate this feature to ML product and try out it
> with more realistic dataset. A summery of results I have obtained using
> 98th percentile during the process is attached with this.
>
>
> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>
> Thanks and Regards,
> Ashen
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: as...@wso2.com
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sinnathamby Mahesan



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

Reply via email to