@Ashen let's have a code review today, if it's possible. @Srinath Forgot to mention that I've already given some feedback to Ashen, on how he could use Spark transformations effectively in his code.
On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga <as...@wso2.com> wrote: > Okay sure. > > On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando <nir...@wso2.com> wrote: > >> Sure. @Ashen, can you please arrange one? >> >> On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera <srin...@wso2.com> wrote: >> >>> Nirmal, Seshika, shall we do a code review? This code should go into ML >>> after UI part is done. >>> >>> Thanks >>> Srinath >>> >>> On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga <as...@wso2.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> This is the source code of the project. >>>> https://github.com/ashensw/Spark-KMeans-fraud-detection >>>> >>>> Best Regards, >>>> Ashen >>>> >>>> On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga <as...@wso2.com> >>>> wrote: >>>> >>>>> Thanks all for the suggestions, >>>>> >>>>> There are few assumptions I have made, >>>>> >>>>> - Clusters are uniform >>>>> - Fraud data always will be outliers to the normal clusters >>>>> - Clusters are not intersect with each other >>>>> - I have given the number of Iterations as 100. So I assume that >>>>> 100 iterations will be enough to make almost stable clusters >>>>> >>>>> @Maheshakya, >>>>> >>>>> In this dataset consist of higher amount of anomaly data than the >>>>> normal data. But the practical scenario will be otherwise. Because of that >>>>> It will be more unrealistic if I use those 65% of anomaly data to evaluate >>>>> the model. The amount of normal data I used to build the model is also >>>>> less >>>>> than those 65% of anomaly data. Yes since our purpose is to detect >>>>> anomalies It would be good to try with more anomaly data to evaluate the >>>>> model.Thanks and I'll try to use them also. >>>>> >>>>> Best Regards, >>>>> >>>>> Ashen >>>>> >>>>> On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena < >>>>> mahesha...@wso2.com> wrote: >>>>> >>>>>> Is there any particular reason why you are putting aside 65% of >>>>>> anomalous data at the evaluation? Since there is an obvious imbalance >>>>>> when >>>>>> the numbers of normal and abnormal cases are taken into account, you will >>>>>> get greater accuracy at the evaluation because a model tends to produce >>>>>> more accurate results for the class with the greater size. But it's not >>>>>> the >>>>>> case for the class of smaller size. With less number of records, it wont >>>>>> make much impact on the accuracy. Hence IMO, it would be better if you >>>>>> could evaluate with more anomalous data. >>>>>> i.e. number of records of each class needs to be roughly equal. >>>>>> >>>>>> Best regards >>>>>> >>>>>> On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Ashen, >>>>>>> >>>>>>> It would be better if you can add the assumptions you make in this >>>>>>> process (uniform clusters etc). It will make the process more clear IMO. >>>>>>> >>>>>>> Regards, >>>>>>> CD >>>>>>> >>>>>>> On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Can we see the code too? >>>>>>>> >>>>>>>> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I am currently working on fraud detection project. I was able to >>>>>>>>> cluster the KDD cup 99 network anomaly detection dataset using apache >>>>>>>>> spark >>>>>>>>> k means algorithm. So far I was able to achieve 99% accuracy rate >>>>>>>>> from this >>>>>>>>> dataset.The steps I have followed during the process are mentioned >>>>>>>>> below. >>>>>>>>> >>>>>>>>> - Separate the dataset into two parts (normal data and anomaly >>>>>>>>> data) by filtering the label >>>>>>>>> - Splits each two parts of data as follows >>>>>>>>> - normal data >>>>>>>>> - 65% - to train the model >>>>>>>>> - 15% - to optimize the model by adjusting hyper >>>>>>>>> parameters >>>>>>>>> - 20% - to evaluate the model >>>>>>>>> - anomaly data >>>>>>>>> - 65% - no use >>>>>>>>> - 15% - to optimize the model by adjusting hyper >>>>>>>>> parameters >>>>>>>>> - 20% - to evaluate the model >>>>>>>>> - Prepossess the dataset >>>>>>>>> - Drop out non numerical features since k means can only >>>>>>>>> handle numerical values >>>>>>>>> - Normalize all the values to 1-0 range >>>>>>>>> - Cluster the 65% of normal data using Apache spark K means >>>>>>>>> and build the model (15% of both normal and anomaly data were used >>>>>>>>> to tune >>>>>>>>> the hyper parameters such as k, percentile etc. to get an >>>>>>>>> optimized model) >>>>>>>>> - Finally evaluate the model using 20% of both normal and >>>>>>>>> anomaly data. >>>>>>>>> >>>>>>>>> Method of identifying a fraud as follows, >>>>>>>>> >>>>>>>>> - When a new data point comes, get the closest cluster center >>>>>>>>> by using k means predict function. >>>>>>>>> - I have calculate 98th percentile distance for each cluster. >>>>>>>>> (98 was the best value I got by tuning the model with different >>>>>>>>> values) >>>>>>>>> - Then I checked whether the distance of new data point with >>>>>>>>> the given cluster center is less than or grater than the 98th >>>>>>>>> percentile of >>>>>>>>> that cluster. If it is less than the percentile it is considered >>>>>>>>> as a >>>>>>>>> normal data. If it is grater than the percentile it is considered >>>>>>>>> as a >>>>>>>>> fraud since it is in outside the cluster. >>>>>>>>> >>>>>>>>> Our next step is to integrate this feature to ML product and try >>>>>>>>> out it with more realistic dataset. A summery of results I have >>>>>>>>> obtained >>>>>>>>> using 98th percentile during the process is attached with this. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing >>>>>>>>> >>>>>>>>> Thanks and Regards, >>>>>>>>> Ashen >>>>>>>>> -- >>>>>>>>> *Ashen Weerathunga* >>>>>>>>> Software Engineer - Intern >>>>>>>>> WSO2 Inc.: http://wso2.com >>>>>>>>> lean.enterprise.middleware >>>>>>>>> >>>>>>>>> Email: as...@wso2.com >>>>>>>>> Mobile: +94 716042995 <94716042995> >>>>>>>>> LinkedIn: >>>>>>>>> *http://lk.linkedin.com/in/ashenweerathunga >>>>>>>>> <http://lk.linkedin.com/in/ashenweerathunga>* >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Thanks & regards, >>>>>>>> Nirmal >>>>>>>> >>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>> Mobile: +94715779733 >>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *CD Athuraliya* >>>>>>> Software Engineer >>>>>>> WSO2, Inc. >>>>>>> lean . enterprise . middleware >>>>>>> Mobile: +94 716288847 <94716288847> >>>>>>> LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter >>>>>>> <https://twitter.com/cdathuraliya> | Blog >>>>>>> <http://cdathuraliya.tumblr.com/> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Pruthuvi Maheshakya Wijewardena >>>>>> Software Engineer >>>>>> WSO2 : http://wso2.com/ >>>>>> Email: mahesha...@wso2.com >>>>>> Mobile: +94711228855 >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Ashen Weerathunga* >>>>> Software Engineer - Intern >>>>> WSO2 Inc.: http://wso2.com >>>>> lean.enterprise.middleware >>>>> >>>>> Email: as...@wso2.com >>>>> Mobile: +94 716042995 <94716042995> >>>>> LinkedIn: >>>>> *http://lk.linkedin.com/in/ashenweerathunga >>>>> <http://lk.linkedin.com/in/ashenweerathunga>* >>>>> >>>> >>>> >>>> >>>> -- >>>> *Ashen Weerathunga* >>>> Software Engineer - Intern >>>> WSO2 Inc.: http://wso2.com >>>> lean.enterprise.middleware >>>> >>>> Email: as...@wso2.com >>>> Mobile: +94 716042995 <94716042995> >>>> LinkedIn: >>>> *http://lk.linkedin.com/in/ashenweerathunga >>>> <http://lk.linkedin.com/in/ashenweerathunga>* >>>> >>> >>> >>> >>> -- >>> ============================ >>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>> Site: http://people.apache.org/~hemapani/ >>> Photos: http://www.flickr.com/photos/hemapani/ >>> Phone: 0772360902 >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > *Ashen Weerathunga* > Software Engineer - Intern > WSO2 Inc.: http://wso2.com > lean.enterprise.middleware > > Email: as...@wso2.com > Mobile: +94 716042995 <94716042995> > LinkedIn: > *http://lk.linkedin.com/in/ashenweerathunga > <http://lk.linkedin.com/in/ashenweerathunga>* > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev