@Ashen let's have a code review today, if it's possible.

@Srinath Forgot to mention that I've already given some feedback to Ashen,
on how he could use Spark transformations effectively in his code.

On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga <as...@wso2.com> wrote:

> Okay sure.
>
> On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando <nir...@wso2.com> wrote:
>
>> Sure. @Ashen, can you please arrange one?
>>
>> On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera <srin...@wso2.com> wrote:
>>
>>> Nirmal, Seshika, shall we do a code review? This code should go into ML
>>> after UI part is done.
>>>
>>> Thanks
>>> Srinath
>>>
>>> On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga <as...@wso2.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> This is the source code of the project.
>>>> https://github.com/ashensw/Spark-KMeans-fraud-detection
>>>>
>>>> Best Regards,
>>>> Ashen
>>>>
>>>> On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga <as...@wso2.com>
>>>> wrote:
>>>>
>>>>> Thanks all for the suggestions,
>>>>>
>>>>> There are few assumptions I have made,
>>>>>
>>>>>    - Clusters are uniform
>>>>>    - Fraud data always will be outliers to the normal clusters
>>>>>    - Clusters are not intersect with each other
>>>>>    - I have given the number of Iterations as 100. So I assume that
>>>>>    100 iterations will be enough to make almost stable clusters
>>>>>
>>>>> @Maheshakya,
>>>>>
>>>>> In this dataset consist of higher amount of anomaly data than the
>>>>> normal data. But the practical scenario will be otherwise. Because of that
>>>>> It will be more unrealistic if I use those 65% of anomaly data to evaluate
>>>>> the model. The amount of normal data I used to build the model is also 
>>>>> less
>>>>> than those 65% of anomaly data. Yes since our purpose is to detect
>>>>> anomalies It would be good to try with more anomaly data to evaluate the
>>>>> model.Thanks and I'll try to use them also.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Ashen
>>>>>
>>>>> On Tue, Aug 25, 2015 at 12:35 PM, Maheshakya Wijewardena <
>>>>> mahesha...@wso2.com> wrote:
>>>>>
>>>>>> Is there any particular reason why you are putting aside 65% of
>>>>>> anomalous data at the evaluation? Since there is an obvious imbalance 
>>>>>> when
>>>>>> the numbers of normal and abnormal cases are taken into account, you will
>>>>>> get greater accuracy at the evaluation because a model tends to produce
>>>>>> more accurate results for the class with the greater size. But it's not 
>>>>>> the
>>>>>> case for the class of smaller size. With less number of records, it wont
>>>>>> make much impact on the accuracy. Hence IMO, it would be better if you
>>>>>> could evaluate with more anomalous data.
>>>>>> i.e. number of records of each class needs to be roughly equal.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> On Tue, Aug 25, 2015 at 12:05 PM, CD Athuraliya <chathur...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ashen,
>>>>>>>
>>>>>>> It would be better if you can add the assumptions you make in this
>>>>>>> process (uniform clusters etc). It will make the process more clear IMO.
>>>>>>>
>>>>>>> Regards,
>>>>>>> CD
>>>>>>>
>>>>>>> On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando <nir...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can we see the code too?
>>>>>>>>
>>>>>>>> On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga <as...@wso2.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am currently working on fraud detection project. I was able to
>>>>>>>>> cluster the KDD cup 99 network anomaly detection dataset using apache 
>>>>>>>>> spark
>>>>>>>>> k means algorithm. So far I was able to achieve 99% accuracy rate 
>>>>>>>>> from this
>>>>>>>>> dataset.The steps I have followed during the process are mentioned 
>>>>>>>>> below.
>>>>>>>>>
>>>>>>>>>    - Separate the dataset into two parts (normal data and anomaly
>>>>>>>>>    data) by filtering the label
>>>>>>>>>    - Splits each two parts of data as follows
>>>>>>>>>       - normal data
>>>>>>>>>       - 65% - to train the model
>>>>>>>>>          - 15% - to optimize the model by adjusting hyper
>>>>>>>>>          parameters
>>>>>>>>>          - 20% - to evaluate the model
>>>>>>>>>       - anomaly data
>>>>>>>>>          - 65% - no use
>>>>>>>>>          - 15% - to optimize the model by adjusting hyper
>>>>>>>>>          parameters
>>>>>>>>>          - 20% - to evaluate the model
>>>>>>>>>       - Prepossess the dataset
>>>>>>>>>       - Drop out non numerical features since k means can only
>>>>>>>>>       handle numerical values
>>>>>>>>>       - Normalize all the values to 1-0 range
>>>>>>>>>       - Cluster the 65% of normal data using Apache spark K means
>>>>>>>>>    and build the model (15% of both normal and anomaly data were used 
>>>>>>>>> to tune
>>>>>>>>>    the hyper parameters such as k, percentile etc. to get an 
>>>>>>>>> optimized model)
>>>>>>>>>    - Finally evaluate the model using 20% of both normal and
>>>>>>>>>    anomaly data.
>>>>>>>>>
>>>>>>>>> Method of identifying a fraud as follows,
>>>>>>>>>
>>>>>>>>>    - When a new data point comes, get the closest cluster center
>>>>>>>>>    by using k means predict function.
>>>>>>>>>    - I have calculate 98th percentile distance for each cluster.
>>>>>>>>>    (98 was the best value I got by tuning the model with different 
>>>>>>>>> values)
>>>>>>>>>    - Then I checked whether the distance of new data point with
>>>>>>>>>    the given cluster center is less than or grater than the 98th 
>>>>>>>>> percentile of
>>>>>>>>>    that cluster. If it is less than the percentile it is considered 
>>>>>>>>> as a
>>>>>>>>>    normal data. If it is grater than the percentile it is considered 
>>>>>>>>> as a
>>>>>>>>>    fraud since it is in outside the cluster.
>>>>>>>>>
>>>>>>>>> Our next step is to integrate this feature to ML product and try
>>>>>>>>> out it with more realistic dataset. A summery of results I have 
>>>>>>>>> obtained
>>>>>>>>> using 98th percentile during the process is attached with this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ashen
>>>>>>>>> --
>>>>>>>>> *Ashen Weerathunga*
>>>>>>>>> Software Engineer - Intern
>>>>>>>>> WSO2 Inc.: http://wso2.com
>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>
>>>>>>>>> Email: as...@wso2.com
>>>>>>>>> Mobile: +94 716042995 <94716042995>
>>>>>>>>> LinkedIn:
>>>>>>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>>>>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Thanks & regards,
>>>>>>>> Nirmal
>>>>>>>>
>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>> Mobile: +94715779733
>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *CD Athuraliya*
>>>>>>> Software Engineer
>>>>>>> WSO2, Inc.
>>>>>>> lean . enterprise . middleware
>>>>>>> Mobile: +94 716288847 <94716288847>
>>>>>>> LinkedIn <http://lk.linkedin.com/in/cdathuraliya> | Twitter
>>>>>>> <https://twitter.com/cdathuraliya> | Blog
>>>>>>> <http://cdathuraliya.tumblr.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>> Software Engineer
>>>>>> WSO2 : http://wso2.com/
>>>>>> Email: mahesha...@wso2.com
>>>>>> Mobile: +94711228855
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Ashen Weerathunga*
>>>>> Software Engineer - Intern
>>>>> WSO2 Inc.: http://wso2.com
>>>>> lean.enterprise.middleware
>>>>>
>>>>> Email: as...@wso2.com
>>>>> Mobile: +94 716042995 <94716042995>
>>>>> LinkedIn:
>>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Ashen Weerathunga*
>>>> Software Engineer - Intern
>>>> WSO2 Inc.: http://wso2.com
>>>> lean.enterprise.middleware
>>>>
>>>> Email: as...@wso2.com
>>>> Mobile: +94 716042995 <94716042995>
>>>> LinkedIn:
>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: as...@wso2.com
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to