Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
@Nirmal, okay i'll arange it today. @Mahesan Thanks for the suggestion. yes 100 must me too high for some cases. I thought that during 100 iterations most probably it will converge to stable clusters. Thats why I put 100. yes as cases like k = 100 it might be not enough. Thanks and ill try with different number of iterations also. On Wed, Aug 26, 2015 at 9:31 AM, Nirmal Fernando nir...@wso2.com wrote: @Ashen let's have a code review today, if it's possible. @Srinath Forgot to mention that I've already given some feedback to Ashen, on how he could use Spark transformations effectively in his code. On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga as...@wso2.com wrote: Okay sure. On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote: Sure. @Ashen, can you please arrange one? On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote: Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote: Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
@Ashen let's have a code review today, if it's possible. @Srinath Forgot to mention that I've already given some feedback to Ashen, on how he could use Spark transformations effectively in his code. On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga as...@wso2.com wrote: Okay sure. On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote: Sure. @Ashen, can you please arrange one? On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote: Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote: Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- *CD Athuraliya* Software Engineer WSO2, Inc. lean . enterprise . middleware Mobile: +94 716288847 94716288847 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/ -- Pruthuvi Maheshakya
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote: Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Sure. @Ashen, can you please arrange one? On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote: Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote: Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn:
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- *CD Athuraliya* Software Engineer WSO2, Inc. lean . enterprise . middleware Mobile: +94 716288847 94716288847 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/ ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- *CD Athuraliya* Software Engineer WSO2, Inc. lean . enterprise . middleware Mobile: +94 716288847 94716288847 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/ -- Pruthuvi Maheshakya Wijewardena Software Engineer WSO2 : http://wso2.com/ Email: mahesha...@wso2.com Mobile: +94711228855 ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 94716042995 LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga http://lk.linkedin.com/in/ashenweerathunga* -- Thanks regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- *CD Athuraliya* Software Engineer WSO2, Inc. lean .
Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset
Okay sure. On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando nir...@wso2.com wrote: Sure. @Ashen, can you please arrange one? On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera srin...@wso2.com wrote: Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga as...@wso2.com wrote: Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga as...@wso2.com wrote: Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations will be enough to make almost stable clusters @Maheshakya, In this dataset consist of higher amount of anomaly data than the normal data. But the practical scenario will be otherwise. Because of that It will be more unrealistic if I use those 65% of anomaly data to evaluate the model. The amount of normal data I used to build the model is also less than those 65% of anomaly data. Yes since our purpose is to detect anomalies It would be good to try with more anomaly data to evaluate the model.Thanks and I'll try to use them also. Best Regards, Ashen On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal. Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya chathur...@wso2.com wrote: Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando nir...@wso2.com wrote: Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga as...@wso2.com wrote: Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below. - Separate the dataset into two parts (normal data and anomaly data) by filtering the label - Splits each two parts of data as follows - normal data - 65% - to train the model - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - anomaly data - 65% - no use - 15% - to optimize the model by adjusting hyper parameters - 20% - to evaluate the model - Prepossess the dataset - Drop out non numerical features since k means can only handle numerical values - Normalize all the values to 1-0 range - Cluster the 65% of normal data using Apache spark K means and build the model (15% of both normal and anomaly data were used to tune the hyper parameters such as k, percentile etc. to get an optimized model) - Finally evaluate the model using 20% of both normal and anomaly data. Method of identifying a fraud as follows, - When a new data point comes, get the closest cluster center by using k means predict function. - I have calculate 98th percentile distance for each cluster. (98 was the best value I got by tuning the model with different values) - Then I checked whether the distance of new data point with the given cluster center is less than or grater than the 98th percentile of that cluster. If it is less than the percentile it is considered as a normal data. If it is grater than the percentile it is considered as a fraud since it is in outside the cluster. Our next step is to integrate this feature to ML product and try out it with more realistic dataset. A summery of results I have obtained using 98th percentile during the process is attached with this. https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing Thanks and Regards, Ashen -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com