[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121300#comment-13121300
]
Jeff Eastman commented on MAHOUT-825:
-------------------------------------
One final response:
If "Canopy Clustering is not even a clustering algorithm" (I disagree), then
why is this patch proposing to augment its classification semantics with
outlier elimination? Accepting this proposition, canopy should only be used for
cluster generation e.g. for priming kmeans.
That said, outlier elimination could be considered as a new option for all the
various clustering classification steps. The question in my mind is how best
would this be specified? Clusters, in general, have a center and a radius that
are determined by the points which have been assigned to them in the respective
cluster generation steps. Offering a single, constant distance threshold to
identify outliers wouldn't be my first choice: I'd suggest using the radius (=
stdev) of each cluster instead, perhaps everything > x*radius would be
considered an outlier. Even canopy clusters have these Gaussian statistics.
Using T1 would be an especially poor choice given arguments I've made earlier.
Finally, outlier post-processing is trivial once we arrive at a solid
definition of "outlier": Read the clusters into the mapper, compare each of the
clustered points with its cluster statistics and determine if it is an outlier.
If it is not, output the point. The job does not require any reducers and the
outlier removal can easily be folded into other post-processing steps which
are, also, application-specific.
I'm still -1 on this patch
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira