[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119910#comment-13119910
]
Paritosh Ranjan commented on MAHOUT-825:
----------------------------------------
Canopy is primarily used to calculate the approx number of clusters ( of size
around t1, t2 ), and then use this number of clusters ( and its centroids ) in
some algorithm like K-means. Which is already being done in buildClusters phase.
The clusterData is not a Canopy Algorithm specific thing. Its a utility to help
identify/classify the clusters even before going to K-means, with the canopy
centroids, if, you think, that, your vectors are well separated and your
distance (t1, t2) is appropriate. This helps in faster clustering ( Because it
eradicates the need to go to K-means ).
So, assigning any remote, isolated point to any canopy which is really at a
long distance from its centroid simply degrades the quality of the clustered
points. There is no benefit from doing this.
And, The computeParameters() or computeCentroid() is not even called during
clusterData phase. So, assigning points is clusterData phase is not having any
impact on the centroid of the canopy, which is already calculated in the
buildCluster phase.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira