Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
@Nirmal, okay i'll arange it today.

@Mahesan

Thanks for the suggestion. yes 100 must me too high for some cases. I
thought that during 100 iterations most probably it will converge to stable
clusters. Thats why I put 100. yes as cases like k = 100 it might be not
enough. Thanks and ill try with different number of iterations also.

On Wed, Aug 26, 2015 at 9:31 AM, Nirmal Fernando nir...@wso2.com wrote:

 @Ashen let's have a code review today, if it's possible.

 @Srinath Forgot to mention that I've already given some feedback to Ashen,
 on how he could use Spark transformations effectively in his code.

 On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga as...@wso2.com wrote:

 Okay sure.

 On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote:

 Sure. @Ashen, can you please arrange one?

 On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com
 wrote:

 Nirmal, Seshika, shall we do a code review? This code should go into ML
 after UI part is done.

 Thanks
 Srinath

 On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 This is the source code of the project.
 https://github.com/ashensw/Spark-KMeans-fraud-detection

 Best Regards,
 Ashen

 On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that
100 iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the
 normal data. But the practical scenario will be otherwise. Because of 
 that
 It will be more unrealistic if I use those 65% of anomaly data to 
 evaluate
 the model. The amount of normal data I used to build the model is also 
 less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of
 anomalous data at the evaluation? Since there is an obvious imbalance 
 when
 the numbers of normal and abnormal cases are taken into account, you 
 will
 get greater accuracy at the evaluation because a model tends to produce
 more accurate results for the class with the greater size. But it's not 
 the
 case for the class of smaller size. With less number of records, it wont
 make much impact on the accuracy. Hence IMO, it would be better if you
 could evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
  wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear 
 IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga 
 as...@wso2.com wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using 
 apache spark
 k means algorithm. So far I was able to achieve 99% accuracy rate 
 from this
 dataset.The steps I have followed during the process are mentioned 
 below.

- Separate the dataset into two parts (normal data and
anomaly data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only
   handle numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K
means and build the model (15% of both normal and anomaly data 
 were used to
tune the hyper parameters such as k, percentile etc. to get an 
 optimized
model)
- Finally evaluate the model using 20% of both normal and
anomaly data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center
by using k means predict function.
- I have calculate 98th percentile distance for each cluster.
(98 was the best value I got by tuning the model with different 
 values)
- Then I checked whether the distance of new data point with
the given 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
@Ashen let's have a code review today, if it's possible.

@Srinath Forgot to mention that I've already given some feedback to Ashen,
on how he could use Spark transformations effectively in his code.

On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga as...@wso2.com wrote:

 Okay sure.

 On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote:

 Sure. @Ashen, can you please arrange one?

 On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote:

 Nirmal, Seshika, shall we do a code review? This code should go into ML
 after UI part is done.

 Thanks
 Srinath

 On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 This is the source code of the project.
 https://github.com/ashensw/Spark-KMeans-fraud-detection

 Best Regards,
 Ashen

 On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that
100 iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the
 normal data. But the practical scenario will be otherwise. Because of that
 It will be more unrealistic if I use those 65% of anomaly data to evaluate
 the model. The amount of normal data I used to build the model is also 
 less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of
 anomalous data at the evaluation? Since there is an obvious imbalance 
 when
 the numbers of normal and abnormal cases are taken into account, you will
 get greater accuracy at the evaluation because a model tends to produce
 more accurate results for the class with the greater size. But it's not 
 the
 case for the class of smaller size. With less number of records, it wont
 make much impact on the accuracy. Hence IMO, it would be better if you
 could evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
  wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache 
 spark
 k means algorithm. So far I was able to achieve 99% accuracy rate 
 from this
 dataset.The steps I have followed during the process are mentioned 
 below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only
   handle numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means
and build the model (15% of both normal and anomaly data were used 
 to tune
the hyper parameters such as k, percentile etc. to get an 
 optimized model)
- Finally evaluate the model using 20% of both normal and
anomaly data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center
by using k means predict function.
- I have calculate 98th percentile distance for each cluster.
(98 was the best value I got by tuning the model with different 
 values)
- Then I checked whether the distance of new data point with
the given cluster center is less than or grater than the 98th 
 percentile of
that cluster. If it is less than the percentile it is considered 
 as a
normal data. If it is grater than the percentile it is considered 
 as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try
 out it with more realistic dataset. A summery of results I have 
 obtained
 using 98th 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Thanks all for the suggestions,

There are few assumptions I have made,

   - Clusters are uniform
   - Fraud data always will be outliers to the normal clusters
   - Clusters are not intersect with each other
   - I have given the number of Iterations as 100. So I assume that 100
   iterations will be enough to make almost stable clusters

@Maheshakya,

In this dataset consist of higher amount of anomaly data than the normal
data. But the practical scenario will be otherwise. Because of that It will
be more unrealistic if I use those 65% of anomaly data to evaluate the
model. The amount of normal data I used to build the model is also less
than those 65% of anomaly data. Yes since our purpose is to detect
anomalies It would be good to try with more anomaly data to evaluate the
model.Thanks and I'll try to use them also.

Best Regards,

Ashen

On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of anomalous
 data at the evaluation? Since there is an obvious imbalance when the
 numbers of normal and abnormal cases are taken into account, you will get
 greater accuracy at the evaluation because a model tends to produce more
 accurate results for the class with the greater size. But it's not the case
 for the class of smaller size. With less number of records, it wont make
 much impact on the accuracy. Hence IMO, it would be better if you could
 evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache spark
 k means algorithm. So far I was able to achieve 99% accuracy rate from this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only handle
   numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means and
build the model (15% of both normal and anomaly data were used to tune 
 the
hyper parameters such as k, percentile etc. to get an optimized model)
- Finally evaluate the model using 20% of both normal and anomaly
data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster. (98
was the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out it
 with more realistic dataset. A summery of results I have obtained using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




 --

 Thanks  regards,
 Nirmal

 Team Lead - WSO2 Machine Learner
 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 Pruthuvi Maheshakya 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Srinath Perera
Nirmal, Seshika, shall we do a code review? This code should go into ML
after UI part is done.

Thanks
Srinath

On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote:

 Hi all,

 This is the source code of the project.
 https://github.com/ashensw/Spark-KMeans-fraud-detection

 Best Regards,
 Ashen

 On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that 100
iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the normal
 data. But the practical scenario will be otherwise. Because of that It will
 be more unrealistic if I use those 65% of anomaly data to evaluate the
 model. The amount of normal data I used to build the model is also less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of
 anomalous data at the evaluation? Since there is an obvious imbalance when
 the numbers of normal and abnormal cases are taken into account, you will
 get greater accuracy at the evaluation because a model tends to produce
 more accurate results for the class with the greater size. But it's not the
 case for the class of smaller size. With less number of records, it wont
 make much impact on the accuracy. Hence IMO, it would be better if you
 could evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache 
 spark
 k means algorithm. So far I was able to achieve 99% accuracy rate from 
 this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only
   handle numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means
and build the model (15% of both normal and anomaly data were used to 
 tune
the hyper parameters such as k, percentile etc. to get an optimized 
 model)
- Finally evaluate the model using 20% of both normal and anomaly
data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster. (98
was the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile 
 of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out
 it with more realistic dataset. A summery of results I have obtained 
 using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




 --

 Thanks  regards,
 Nirmal

 Team Lead - WSO2 Machine 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
Sure. @Ashen, can you please arrange one?

On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote:

 Nirmal, Seshika, shall we do a code review? This code should go into ML
 after UI part is done.

 Thanks
 Srinath

 On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote:

 Hi all,

 This is the source code of the project.
 https://github.com/ashensw/Spark-KMeans-fraud-detection

 Best Regards,
 Ashen

 On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that 100
iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the normal
 data. But the practical scenario will be otherwise. Because of that It will
 be more unrealistic if I use those 65% of anomaly data to evaluate the
 model. The amount of normal data I used to build the model is also less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of
 anomalous data at the evaluation? Since there is an obvious imbalance when
 the numbers of normal and abnormal cases are taken into account, you will
 get greater accuracy at the evaluation because a model tends to produce
 more accurate results for the class with the greater size. But it's not the
 case for the class of smaller size. With less number of records, it wont
 make much impact on the accuracy. Hence IMO, it would be better if you
 could evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache 
 spark
 k means algorithm. So far I was able to achieve 99% accuracy rate from 
 this
 dataset.The steps I have followed during the process are mentioned 
 below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only
   handle numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means
and build the model (15% of both normal and anomaly data were used 
 to tune
the hyper parameters such as k, percentile etc. to get an optimized 
 model)
- Finally evaluate the model using 20% of both normal and
anomaly data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster.
(98 was the best value I got by tuning the model with different 
 values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile 
 of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as 
 a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out
 it with more realistic dataset. A summery of results I have obtained 
 using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread CD Athuraliya
Hi Ashen,

It would be better if you can add the assumptions you make in this process
(uniform clusters etc). It will make the process more clear IMO.

Regards,
CD

On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to cluster
 the KDD cup 99 network anomaly detection dataset using apache spark k means
 algorithm. So far I was able to achieve 99% accuracy rate from this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly data)
by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only handle
   numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means and
build the model (15% of both normal and anomaly data were used to tune the
hyper parameters such as k, percentile etc. to get an optimized model)
- Finally evaluate the model using 20% of both normal and anomaly
data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster. (98 was
the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out it
 with more realistic dataset. A summery of results I have obtained using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




 --

 Thanks  regards,
 Nirmal

 Team Lead - WSO2 Machine Learner
 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





-- 
*CD Athuraliya*
Software Engineer
WSO2, Inc.
lean . enterprise . middleware
Mobile: +94 716288847 94716288847
LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Maheshakya Wijewardena
Is there any particular reason why you are putting aside 65% of anomalous
data at the evaluation? Since there is an obvious imbalance when the
numbers of normal and abnormal cases are taken into account, you will get
greater accuracy at the evaluation because a model tends to produce more
accurate results for the class with the greater size. But it's not the case
for the class of smaller size. With less number of records, it wont make
much impact on the accuracy. Hence IMO, it would be better if you could
evaluate with more anomalous data.
i.e. number of records of each class needs to be roughly equal.

Best regards

On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this process
 (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to cluster
 the KDD cup 99 network anomaly detection dataset using apache spark k means
 algorithm. So far I was able to achieve 99% accuracy rate from this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly data)
by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only handle
   numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means and
build the model (15% of both normal and anomaly data were used to tune 
 the
hyper parameters such as k, percentile etc. to get an optimized model)
- Finally evaluate the model using 20% of both normal and anomaly
data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster. (98
was the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out it
 with more realistic dataset. A summery of results I have obtained using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




 --

 Thanks  regards,
 Nirmal

 Team Lead - WSO2 Machine Learner
 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 : http://wso2.com/
Email: mahesha...@wso2.com
Mobile: +94711228855
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
Can we see the code too?

On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to cluster
 the KDD cup 99 network anomaly detection dataset using apache spark k means
 algorithm. So far I was able to achieve 99% accuracy rate from this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly data)
by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only handle
   numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means and
build the model (15% of both normal and anomaly data were used to tune the
hyper parameters such as k, percentile etc. to get an optimized model)
- Finally evaluate the model using 20% of both normal and anomaly data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by using
k means predict function.
- I have calculate 98th percentile distance for each cluster. (98 was
the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the given
cluster center is less than or grater than the 98th percentile of that
cluster. If it is less than the percentile it is considered as a normal
data. If it is grater than the percentile it is considered as a fraud since
it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out it
 with more realistic dataset. A summery of results I have obtained using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




-- 

Thanks  regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Hi all,

This is the source code of the project.
https://github.com/ashensw/Spark-KMeans-fraud-detection

Best Regards,
Ashen

On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that 100
iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the normal
 data. But the practical scenario will be otherwise. Because of that It will
 be more unrealistic if I use those 65% of anomaly data to evaluate the
 model. The amount of normal data I used to build the model is also less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of anomalous
 data at the evaluation? Since there is an obvious imbalance when the
 numbers of normal and abnormal cases are taken into account, you will get
 greater accuracy at the evaluation because a model tends to produce more
 accurate results for the class with the greater size. But it's not the case
 for the class of smaller size. With less number of records, it wont make
 much impact on the accuracy. Hence IMO, it would be better if you could
 evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache 
 spark
 k means algorithm. So far I was able to achieve 99% accuracy rate from 
 this
 dataset.The steps I have followed during the process are mentioned below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only handle
   numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means and
build the model (15% of both normal and anomaly data were used to tune 
 the
hyper parameters such as k, percentile etc. to get an optimized model)
- Finally evaluate the model using 20% of both normal and anomaly
data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center by
using k means predict function.
- I have calculate 98th percentile distance for each cluster. (98
was the best value I got by tuning the model with different values)
- Then I checked whether the distance of new data point with the
given cluster center is less than or grater than the 98th percentile of
that cluster. If it is less than the percentile it is considered as a
normal data. If it is grater than the percentile it is considered as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try out
 it with more realistic dataset. A summery of results I have obtained using
 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com
 lean.enterprise.middleware

 Email: as...@wso2.com
 Mobile: +94 716042995 94716042995
 LinkedIn:
 *http://lk.linkedin.com/in/ashenweerathunga
 http://lk.linkedin.com/in/ashenweerathunga*




 --

 Thanks  regards,
 Nirmal

 Team Lead - WSO2 Machine Learner
 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . 

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Okay sure.

On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote:

 Sure. @Ashen, can you please arrange one?

 On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote:

 Nirmal, Seshika, shall we do a code review? This code should go into ML
 after UI part is done.

 Thanks
 Srinath

 On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 This is the source code of the project.
 https://github.com/ashensw/Spark-KMeans-fraud-detection

 Best Regards,
 Ashen

 On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com
 wrote:

 Thanks all for the suggestions,

 There are few assumptions I have made,

- Clusters are uniform
- Fraud data always will be outliers to the normal clusters
- Clusters are not intersect with each other
- I have given the number of Iterations as 100. So I assume that
100 iterations will be enough to make almost stable clusters

 @Maheshakya,

 In this dataset consist of higher amount of anomaly data than the
 normal data. But the practical scenario will be otherwise. Because of that
 It will be more unrealistic if I use those 65% of anomaly data to evaluate
 the model. The amount of normal data I used to build the model is also less
 than those 65% of anomaly data. Yes since our purpose is to detect
 anomalies It would be good to try with more anomaly data to evaluate the
 model.Thanks and I'll try to use them also.

 Best Regards,

 Ashen

 On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Is there any particular reason why you are putting aside 65% of
 anomalous data at the evaluation? Since there is an obvious imbalance when
 the numbers of normal and abnormal cases are taken into account, you will
 get greater accuracy at the evaluation because a model tends to produce
 more accurate results for the class with the greater size. But it's not 
 the
 case for the class of smaller size. With less number of records, it wont
 make much impact on the accuracy. Hence IMO, it would be better if you
 could evaluate with more anomalous data.
 i.e. number of records of each class needs to be roughly equal.

 Best regards

 On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Ashen,

 It would be better if you can add the assumptions you make in this
 process (uniform clusters etc). It will make the process more clear IMO.

 Regards,
 CD

 On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Can we see the code too?

 On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com
 wrote:

 Hi all,

 I am currently working on fraud detection project. I was able to
 cluster the KDD cup 99 network anomaly detection dataset using apache 
 spark
 k means algorithm. So far I was able to achieve 99% accuracy rate from 
 this
 dataset.The steps I have followed during the process are mentioned 
 below.

- Separate the dataset into two parts (normal data and anomaly
data) by filtering the label
- Splits each two parts of data as follows
   - normal data
   - 65% - to train the model
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - anomaly data
  - 65% - no use
  - 15% - to optimize the model by adjusting hyper
  parameters
  - 20% - to evaluate the model
   - Prepossess the dataset
   - Drop out non numerical features since k means can only
   handle numerical values
   - Normalize all the values to 1-0 range
   - Cluster the 65% of normal data using Apache spark K means
and build the model (15% of both normal and anomaly data were used 
 to tune
the hyper parameters such as k, percentile etc. to get an optimized 
 model)
- Finally evaluate the model using 20% of both normal and
anomaly data.

 Method of identifying a fraud as follows,

- When a new data point comes, get the closest cluster center
by using k means predict function.
- I have calculate 98th percentile distance for each cluster.
(98 was the best value I got by tuning the model with different 
 values)
- Then I checked whether the distance of new data point with
the given cluster center is less than or grater than the 98th 
 percentile of
that cluster. If it is less than the percentile it is considered as 
 a
normal data. If it is grater than the percentile it is considered 
 as a
fraud since it is in outside the cluster.

 Our next step is to integrate this feature to ML product and try
 out it with more realistic dataset. A summery of results I have 
 obtained
 using 98th percentile during the process is attached with this.


 https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing

 Thanks and Regards,
 Ashen
 --
 *Ashen Weerathunga*
 Software Engineer - Intern
 WSO2 Inc.: http://wso2.com