[
https://issues.apache.org/jira/browse/GRIFFIN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472199#comment-16472199
]
Enrico D'Urso commented on GRIFFIN-160:
---------------------------------------
Hi,
there are several ways to go for anomaly detection implementation.
The point is to have numerical data. If you want to apply AD against
non-numerical data you have to map string to number somehow.
However, as Griffin uses Spark as the engine, I think K-Means can be an option.
Basically, you have your data: you normalise it, decide the number of clusters,
apply K-means, finally check the distance from final centroids to search for
anomalies. MLlib fully supports it.
Otherwise just get the mean and std and search for samples that are 3sd+ far
from the mean.
More complicated stuff can be done using Covariance matrix and Gaussian
distribution, more info here
[https://www.coursera.org/learn/machine-learning/lecture/C8IJp/helpUrl]
but am not sure if doable in a distributed environment.
Thanks,
Enrico
> Anomaly detection for thousands of tables
> -----------------------------------------
>
> Key: GRIFFIN-160
> URL: https://issues.apache.org/jira/browse/GRIFFIN-160
> Project: Griffin (Incubating)
> Issue Type: New Feature
> Reporter: William Guo
> Assignee: William Guo
> Priority: Major
>
> Hi team,
>
> I am trying find the Griffin road map, and here it is
> [https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap], is this the
> latest version?
>
> We have thousands of tables need to applied for data quality validation, is
> there any simple machine learning algorithm can be applied to detect the data
> quality issue instead of build a lot measures? Will this be added in the
> Griffin road map if possible?
>
> Thanks, Randy
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)