[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120316#comment-13120316
 ] 

Sean Owen commented on MAHOUT-825:
----------------------------------

As far as I understand: t1 and t2 are used in coming up with canopies. Then, 
points are assigned to their nearest canopy, regardless of t1 and t2. That's 
how it works in Mahout, and is certainly *a* way to do it, and probably the 
most simple / simplistic.

It seems that the difference of opinion is just whether t1 should also be 
applied in the second step. Should a point not within t1 of any canopy be 
considered "unclusterable"? I think there's a logic to that. Perhaps I'm just 
influenced by having stared at Mahout, but I don't find that the most expected 
behavior.

Any clustering algorithm I know does put everything in a cluster. This 
canopy-generation business is a heuristic speed-up. I would not expect it to 
mean some things don't get clustered.

In this sense I tend to agree with Jeff, though I don't see anything wrong with 
the additional filtering either; it just seems like an additional step, or 
logic, you could impose afterwards.

But Paritosh remains correct that these outlier points are probably bad for 
clustering? they stick around across iterations of k-means, for example, and 
affect centroids. I have no experience quantifying that effect though it must 
be non-zero.
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to