Re: [jira] [Commented] (GRIFFIN-160) Anomaly detection for thousands of tables

William Guo Mon, 14 May 2018 15:59:06 -0700

yes,

Griffin team will make a doc about contributing points(interfaces) for
measures.


Will let you know when it is ready.



Thanks,
William

On Mon, May 14, 2018 at 5:13 PM, Enrico D'Urso <[email protected]> wrote:

> Hi,
>
> Yes, it sounds a very good idea. I am pretty interested in the topic.
> Is there an ongoing discussion that I can start to look at?
>
> Thanks,
>
> Enrico
>
> On 5/13/18, 2:58 AM, "William Guo" <[email protected]> wrote:
>
>     hi Enrico,
>
>     Yes, since we have released 0.2.0 recently.
>
>     Our next plan will include enhance measures, including support anomaly
>     detection.
>
>
>     Would you like to contribute this feature together?
>
>
>     Thanks,
>     William
>
>     On Sat, May 12, 2018 at 12:22 AM, Enrico D'Urso (JIRA) <
> [email protected]>
>     wrote:
>
>     >
>     >     [ https://issues.apache.org/jira/browse/GRIFFIN-160?page=
>     > com.atlassian.jira.plugin.system.issuetabpanels:comment-
>     > tabpanel&focusedCommentId=16472199#comment-16472199 ]
>     >
>     > Enrico D'Urso commented on GRIFFIN-160:
>     > ---------------------------------------
>     >
>     > Hi,
>     >
>     > there are several ways to go for anomaly detection implementation.
>     >
>     > The point is to have numerical data. If you want to apply AD against
>     > non-numerical data you have to map string to number somehow.
>     >
>     > However, as Griffin uses Spark as the engine, I think K-Means can be
> an
>     > option.
>     >
>     > Basically, you have your data: you normalise it, decide the number of
>     > clusters, apply K-means, finally check the distance from final
> centroids to
>     > search for anomalies. MLlib fully supports it.
>     >
>     > Otherwise just get the mean and std and search for samples that are
> 3sd+
>     > far from the mean.
>     >
>     > More complicated stuff can be done using Covariance matrix and
> Gaussian
>     > distribution, more info here [https://www.coursera.org/
>     > learn/machine-learning/lecture/C8IJp/helpUrl]
>     >
>     > but am not sure if doable in a distributed environment.
>     >
>     >
>     >
>     > Thanks,
>     >
>     > Enrico
>     >
>     >
>     >
>     > > Anomaly detection for thousands of tables
>     > > -----------------------------------------
>     > >
>     > >                 Key: GRIFFIN-160
>     > >                 URL: https://issues.apache.org/
> jira/browse/GRIFFIN-160
>     > >             Project: Griffin (Incubating)
>     > >          Issue Type: New Feature
>     > >            Reporter: William Guo
>     > >            Assignee: William Guo
>     > >            Priority: Major
>     > >
>     > > Hi team,
>     > >
>     > > I am trying find the Griffin road map, and here it is [
>     > https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap], is
> this
>     > the latest version?
>     > >
>     > > We have thousands of tables need to applied for data quality
> validation,
>     > is there any simple machine learning algorithm can be applied to
> detect the
>     > data quality issue instead of build a lot measures?  Will this be
> added in
>     > the Griffin road map if possible?
>     > >
>     > > Thanks, Randy
>     > >
>     >
>     >
>     >
>     > --
>     > This message was sent by Atlassian JIRA
>     > (v7.6.3#76005)
>     >
>
>
>

Re: [jira] [Commented] (GRIFFIN-160) Anomaly detection for thousands of tables

Reply via email to