[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925
 ] 

Ashutosh Trivedi edited comment on SPARK-4038 at 11/12/14 10:53 AM:
--------------------------------------------------------------------

The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, It uses the simplest approach by 
calculating frequency of each attribute in the data set. Some of the people in 
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  


was (Author: rusty):
The questions raised are valid and we want community to discuss it. 

This algorithm deals with categorical data, In my knowledge it uses the 
simplest approach by calculating frequency of each attribute in the data set. 
Some of the people in community are already doing the review and I am working 
on it.

I did not find any other algorithm which work on categorical data to find 
outliers. If you are aware of any other algorithm which is well known please 
share with us.

  

> Outlier Detection Algorithm for MLlib
> -------------------------------------
>
>                 Key: SPARK-4038
>                 URL: https://issues.apache.org/jira/browse/SPARK-4038
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ashutosh Trivedi
>            Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to