[ 
https://issues.apache.org/jira/browse/GRIFFIN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472199#comment-16472199
 ] 

Enrico D'Urso commented on GRIFFIN-160:
---------------------------------------

Hi,

there are several ways to go for anomaly detection implementation.

The point is to have numerical data. If you want to apply AD against 
non-numerical data you have to map string to number somehow.

However, as Griffin uses Spark as the engine, I think K-Means can be an option.

Basically, you have your data: you normalise it, decide the number of clusters, 
apply K-means, finally check the distance from final centroids to search for 
anomalies. MLlib fully supports it.

Otherwise just get the mean and std and search for samples that are 3sd+ far 
from the mean. 

More complicated stuff can be done using Covariance matrix and Gaussian 
distribution, more info here 
[https://www.coursera.org/learn/machine-learning/lecture/C8IJp/helpUrl]

but am not sure if doable in a distributed environment.

 

Thanks,

Enrico

 

> Anomaly detection for thousands of tables
> -----------------------------------------
>
>                 Key: GRIFFIN-160
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-160
>             Project: Griffin (Incubating)
>          Issue Type: New Feature
>            Reporter: William Guo
>            Assignee: William Guo
>            Priority: Major
>
> Hi team,
>  
> I am trying find the Griffin road map, and here it is 
> [https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap], is this the 
> latest version? 
>  
> We have thousands of tables need to applied for data quality validation, is 
> there any simple machine learning algorithm can be applied to detect the data 
> quality issue instead of build a lot measures?  Will this be added in the 
> Griffin road map if possible?
>  
> Thanks, Randy
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to