[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120243#comment-13120243
]
Jeff Eastman commented on MAHOUT-825:
-------------------------------------
Sorry, but I continue to disagree. As you point out, clusterData is not a
canopy-specific activity: It could be factored out of Canopy and Kmeans, since
they do it identically (FuzzyK and Dirichlet can also do maximum-likelihood
classification, as an option). ClusterData simply assigns each point to the
closest, maximum-likelihood cluster given the computed centroids and the
distance measure chosen. Imposing additional semantics which causes some points
to not be classified at all is just not correct, IMHO. T1, in particular, is
unrelated to clusterData and indeed a given point may be within T1 of multiple
canopy clusters. If you want to impose additional semantics, (e.g. to remove
outliers) you need to do this in a separate processing step.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira