[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119445#comment-13119445
 ] 

Jeff Eastman commented on MAHOUT-825:
-------------------------------------

You misinterpreted my statement but what you say is correct. FindClosestCanopy 
is called during the classification, or clustering (clusterData) phase, not 
during the buildClusters phase. The classification phase assigns each point to 
its closest canopy, irrespective of that canopy's T parameters. Introducing 
(d<T1) into this process may be satisfying from your perspective but failing to 
classify a point at all is not satisfactory from mine. 

The T1 argument does not imply any assignment semantics. Points within the T1 
radius will influence the computed centroid of a canopy. It will also impact 
the numPoints of the canopy because this value is used to compute the centroid.

The T2 argument may imply some assignment semantics, in that points found to be 
within T2 will not result in new canopies being output. Canopies having 
numPoints <= clusterFilter will not be output during the buildClusters phase. 
This is quite different than failing to classify a point in the clusterData 
phase.

Canopy is intended to be a simple, approximate, fast clustering algorithm. 
We've already added a number of bells and whistles (T3, T4, clusterFilter) in 
order to make it more usable for large data sets. I just don't see the (d<T1) 
test as being necessary or even correct.
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to