Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection
Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga <as...@wso2.com> wrote: > Thanks all for the suggestions, > > There are few assumptions I have made, > > - Clusters are uniform > - Fraud data always will be outliers to the normal clusters > - Clusters are not intersect with each other > - I have given the number of Iterations as 100. So I assume that 100 > iterations will be enough to make almost stable clusters > > @Maheshakya, > > In this dataset consist of higher amount of anomaly data than the normal > data. But the practical scenario will be otherwise. Because of that It will > be more unrealistic if I use those 65% of anomaly data to evaluate the > model. The amount of normal data I used to build the model is also less > than those 65% of anomaly data. Yes since our purpose is to detect > anomalies It would be good to try with more anomaly data to evaluate the > model.Thanks and I'll try to use them also. > > Best Regards, > > Ashen > > On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena < > mahesha...@wso2.com> wrote: > >> Is there any particular reason why you are putting aside 65% of anomalous >> data at the evaluation? Since there is an obvious imbalance when the >> numbers of normal and abnormal cases are taken into account, you will get >> greater accuracy at the evaluation because a model tends to produce more >> accurate results for the class with the greater size. But it's not the case >> for the class of smaller size. With less number of records, it wont make >> much impact on the accuracy. Hence IMO, it would be better if you could >> evaluate with more anomalous data. >> i.e. number of records of each class needs to be roughly equal. >> >> Best regards >> >> On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com> >> wrote: >> >>> Hi Ashen, >>> >>> It would be better if you can add the assumptions you make in this >>> process (uniform clusters etc). It will make the process more clear IMO. >>> >>> Regards, >>> CD >>> >>> On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com> >>> wrote: >>> >>>> Can we see the code too? >>>> >>>> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I am currently working on fraud detection project. I was able to >>>>> cluster the KDD cup 99 network anomaly detection dataset using apache >>>>> spark >>>>> k means algorithm. So far I was able to achieve 99% accuracy rate from >>>>> this >>>>> dataset.The steps I have followed during the process are mentioned below. >>>>> >>>>> - Separate the dataset into two parts (normal data and anomaly >>>>> data) by filtering the label >>>>> - Splits each two parts of data as follows >>>>> - normal data >>>>> - 65% - to train the model >>>>> - 15% - to optimize the model by adjusting hyper parameters >>>>> - 20% - to evaluate the model >>>>> - anomaly data >>>>> - 65% - no use >>>>> - 15% - to optimize the model by adjusting hyper parameters >>>>> - 20% - to evaluate the model >>>>> - Prepossess the dataset >>>>> - Drop out non numerical features since k means can only handle >>>>> numerical values >>>>> - Normalize all the values to 1-0 range >>>>> - Cluster the 65% of normal data using Apache spark K means and >>>>> build the model (15% of both normal and anomaly data were used to tune >>>>> the >>>>> hyper parameters such as k, percentile etc. to get an optimized model) >>>>> - Finally evaluate the model using 20% of both normal and anomaly >>>>> data. >>>>> >>>>> Method of identifying a fraud as follows, >>>>> >>>>> - When a new data point comes, get the closest cluster center by >>>>> using k means predict function. >>>>> - I have calculate 98th percentile distance for each cluster. (98 >>>>> was the best value I got by tuning the model with different values) >>>>> - Then I checked whether the distance of new data point with the >>>>> given cluster center is less than or grater than the 98th percentile of >>>>> that cluster. If it is less than the percentile it is considered as a >>>>> normal data. If it is grater than the percentile it is considered as a >>>>> fraud since it is in outside the cluster. >>>>> >>>>> Our next step is to integrate this feature to ML product and try out >>>>> it with more realistic dataset. A summery of results I have obtained using >>>>> 98th percentile during the process is attached with this. >>>>> >>>>> >>>>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing >>>>> >>>>> Thanks and Regards, >>>>> Ashen >>>>> -- >>>>> *Ashen Weerathunga* >>>>> Software Engineer - Intern >>>>> WSO2 Inc.: http://wso2.com >>>>> lean.enterprise.middleware >>>>> >>>>> Email: as...@wso2.com >>>>> Mobile: +94 716042995 <94716042995> >>>>> LinkedIn: >>>>> *http://lk.linkedin.com/in/ashenweerathunga >>>>> <http://lk.linkedin.com/in/ashenweerathunga>* >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Team Lead - WSO2 Machine Learner >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >>> >>> -- >>> *CD Athuraliya* >>> Software Engineer >>> WSO2, Inc. >>> lean . enterprise . middleware >>> Mobile: +94 716288847 <94716288847> >>> LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter >>> <https://twitter.com/cdathuraliya> | Blog >>> <http://cdathuraliya.tumblr.com/> >>> >> >> >> >> -- >> Pruthuvi Maheshakya Wijewardena >> Software Engineer >> WSO2 : http://wso2.com/ >> Email: mahesha...@wso2.com >> Mobile: +94711228855 >> >> >> > > > -- > *Ashen Weerathunga* > Software Engineer - Intern > WSO2 Inc.: http://wso2.com > lean.enterprise.middleware > > Email: as...@wso2.com > Mobile: +94 716042995 <94716042995> > LinkedIn: > *http://lk.linkedin.com/in/ashenweerathunga > <http://lk.linkedin.com/in/ashenweerathunga>* > -- *Ashen Weerathunga* Software Engineer - Intern WSO2 Inc.: http://wso2.com lean.enterprise.middleware Email: as...@wso2.com Mobile: +94 716042995 <94716042995> LinkedIn: *http://lk.linkedin.com/in/ashenweerathunga <http://lk.linkedin.com/in/ashenweerathunga>*
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev