Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com> wrote:
> Hi all, > > I am currently working on fraud detection project. I was able to cluster > the KDD cup 99 network anomaly detection dataset using apache spark k means > algorithm. So far I was able to achieve 99% accuracy rate from this > dataset.The steps I have followed during the process are mentioned below. > > - Separate the dataset into two parts (normal data and anomaly data) > by filtering the label > - Splits each two parts of data as follows > - normal data > - 65% - to train the model > - 15% - to optimize the model by adjusting hyper parameters > - 20% - to evaluate the model > - anomaly data > - 65% - no use > - 15% - to optimize the model by adjusting hyper parameters > - 20% - to evaluate the model > - Prepossess the dataset > - Drop out non numerical features since k means can only handle > numerical values > - Normalize all the values to 1-0 range > - Cluster the 65% of normal data using Apache spark K means and > build the model (15% of both normal and anomaly data were used to tune the > hyper parameters such as k, percentile etc. to get an optimized model) > - Finally evaluate the model using 20% of both normal and anomaly data. > > Method of identifying a fraud as follows, > > - When a new data point comes, get the closest cluster center by using > k means predict function. > - I have calculate 98th percentile distance for each cluster. (98 was > the best value I got by tuning the model with different values) > - Then I checked whether the distance of new data point with the given > cluster center is less than or grater than the 98th percentile of that > cluster. If it is less than the percentile it is considered as a normal > data. If it is grater than the percentile it is considered as a fraud since > it is in outside the cluster. > > Our next step is to integrate this feature to ML product and try out it > with more realistic dataset. A summery of results I have obtained using > 98th percentile during the process is attached with this. > > > https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing > > Thanks and Regards, > Ashen > -- > *Ashen Weerathunga* > Software Engineer - Intern > WSO2 Inc.: http://wso2.com > lean.enterprise.middleware > > Email: as...@wso2.com > Mobile: +94 716042995 <94716042995> > LinkedIn: > *http://lk.linkedin.com/in/ashenweerathunga > <http://lk.linkedin.com/in/ashenweerathunga>* > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev