[
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633601#comment-14633601
]
Rakesh Chalasani commented on SPARK-8540:
-----------------------------------------
I think clubbing an algorithm with a specific use case might not be a good
idea, in this case KMeans with anomaly detection. Why not just return the
distances to KMean centers and then the user can write a simple operations over
that column to get the anomalies? If we return distances, finding anomalies
will be just one more line of code and we can have an example showing that.
> KMeans-based outlier detection
> ------------------------------
>
> Key: SPARK-8540
> URL: https://issues.apache.org/jira/browse/SPARK-8540
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Joseph K. Bradley
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Proposal for K-Means-based outlier detection:
> * Cluster data using K-Means
> * Provide prediction/filtering functionality which returns outliers/anomalies
> ** This can take some threshold parameter which specifies either (a) how far
> off a point needs to be to be considered an outlier or (b) how many outliers
> should be returned.
> Note this will require a bit of API design, which should probably be posted
> and discussed on this JIRA before implementation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]