[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120773#comment-13120773
 ] 

Sean Owen commented on MAHOUT-825:
----------------------------------

I am not an expert on this code, so take my comments as informed questions 
rather than something authoritative.

But I think the question of whether every point participates in, or generates 
its own canopy, is separate from whether every point is assigned to a cluster. 
Am I right about this? You're right that not every point will contribute to a 
canopy, or even be in a canopy. But, those points may still become clustered, 
to the canopy to which they are nearest?

I also understand what you're doing with the new patch, and it does seem that 
at best this is a selectable feature. But does it make sense to overload the 
clusterFilter setting to control this as well? I do get that you think they go 
hand-in-hand, and maybe they do, though a separate flag probably still makes 
more sense IMHO.

I'm trying to describe the down-side in clustering a distant point. The cost of 
emitting the point is trivial. Providing an answer, instead of no answer, can't 
be worse, I think. What does it slow down or degrade? I can only come up with 
the possibility that it degrades accuracy in iterative algorithms like k-means.
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to