[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121300#comment-13121300
 ] 

Jeff Eastman commented on MAHOUT-825:
-------------------------------------

One final response:

If "Canopy Clustering is not even a clustering algorithm" (I disagree), then 
why is this patch proposing to augment its classification semantics with 
outlier elimination? Accepting this proposition, canopy should only be used for 
cluster generation e.g. for priming kmeans.

That said, outlier elimination could be considered as a new option for all the 
various clustering classification steps. The question in my mind is how best 
would this be specified? Clusters, in general, have a center and a radius that 
are determined by the points which have been assigned to them in the respective 
cluster generation steps. Offering a single, constant distance threshold to 
identify outliers wouldn't be my first choice: I'd suggest using the radius (= 
stdev) of each cluster instead, perhaps everything > x*radius would be 
considered an outlier. Even canopy clusters have these Gaussian statistics. 
Using T1 would be an especially poor choice given arguments I've made earlier.

Finally, outlier post-processing is trivial once we arrive at a solid 
definition of "outlier": Read the clusters into the mapper, compare each of the 
clustered points with its cluster statistics and determine if it is an outlier. 
If it is not, output the point. The job does not require any reducers and the 
outlier removal can easily be folded into other post-processing steps which 
are, also, application-specific.

I'm still -1 on this patch


                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to