Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate results for the class with the greater size. But it's not the case for the class of smaller size. With less number of records, it wont make much impact on the accuracy. Hence IMO, it would be better if you could evaluate with more anomalous data. i.e. number of records of each class needs to be roughly equal.
Best regards On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com> wrote: > Hi Ashen, > > It would be better if you can add the assumptions you make in this process > (uniform clusters etc). It will make the process more clear IMO. > > Regards, > CD > > On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> Can we see the code too? >> >> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com> >> wrote: >> >>> Hi all, >>> >>> I am currently working on fraud detection project. I was able to cluster >>> the KDD cup 99 network anomaly detection dataset using apache spark k means >>> algorithm. So far I was able to achieve 99% accuracy rate from this >>> dataset.The steps I have followed during the process are mentioned below. >>> >>> - Separate the dataset into two parts (normal data and anomaly data) >>> by filtering the label >>> - Splits each two parts of data as follows >>> - normal data >>> - 65% - to train the model >>> - 15% - to optimize the model by adjusting hyper parameters >>> - 20% - to evaluate the model >>> - anomaly data >>> - 65% - no use >>> - 15% - to optimize the model by adjusting hyper parameters >>> - 20% - to evaluate the model >>> - Prepossess the dataset >>> - Drop out non numerical features since k means can only handle >>> numerical values >>> - Normalize all the values to 1-0 range >>> - Cluster the 65% of normal data using Apache spark K means and >>> build the model (15% of both normal and anomaly data were used to tune >>> the >>> hyper parameters such as k, percentile etc. to get an optimized model) >>> - Finally evaluate the model using 20% of both normal and anomaly >>> data. >>> >>> Method of identifying a fraud as follows, >>> >>> - When a new data point comes, get the closest cluster center by >>> using k means predict function. >>> - I have calculate 98th percentile distance for each cluster. (98 >>> was the best value I got by tuning the model with different values) >>> - Then I checked whether the distance of new data point with the >>> given cluster center is less than or grater than the 98th percentile of >>> that cluster. If it is less than the percentile it is considered as a >>> normal data. If it is grater than the percentile it is considered as a >>> fraud since it is in outside the cluster. >>> >>> Our next step is to integrate this feature to ML product and try out it >>> with more realistic dataset. A summery of results I have obtained using >>> 98th percentile during the process is attached with this. >>> >>> >>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing >>> >>> Thanks and Regards, >>> Ashen >>> -- >>> *Ashen Weerathunga* >>> Software Engineer - Intern >>> WSO2 Inc.: http://wso2.com >>> lean.enterprise.middleware >>> >>> Email: as...@wso2.com >>> Mobile: +94 716042995 <94716042995> >>> LinkedIn: >>> *http://lk.linkedin.com/in/ashenweerathunga >>> <http://lk.linkedin.com/in/ashenweerathunga>* >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > *CD Athuraliya* > Software Engineer > WSO2, Inc. > lean . enterprise . middleware > Mobile: +94 716288847 <94716288847> > LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter > <https://twitter.com/cdathuraliya> | Blog > <http://cdathuraliya.tumblr.com/> > -- Pruthuvi Maheshakya Wijewardena Software Engineer WSO2 : http://wso2.com/ Email: mahesha...@wso2.com Mobile: +94711228855
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev