[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121031#comment-13121031
 ] 

Jeff Eastman commented on MAHOUT-825:
-------------------------------------

cf. "But I think the question of whether every point participates in, or 
generates its own canopy, is separate from whether every point is assigned to a 
cluster. Am I right about this?"

Yes, precisely correct. This issue is confusing the generation of canopies with 
the classification of points given existing canopies. T1 et. al. have a role to 
play in the generation phase but none to play in the classification phase. 
Classification does what it advertises: it groups points into their most likely 
(closest) cluster. This is common behavior over all clustering algorithms. 
Failing to classify a point is not correct.

The clusterFilter option was introduced to improve the performance of the 
mapreduce canopy implementation by allowing the user to eliminate generation of 
very small clusters. This does have a role to play in priming k-means or other 
iterative clustering algorithms, and would have the effect of reducing k. It is 
unrelated to the classification step.

I'm still -1 on this patch. Classification outliers should be removed in a 
separate processing step.
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to