[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120350#comment-13120350
]
Jeff Eastman commented on MAHOUT-825:
-------------------------------------
Outlier points may be problematic for subsequent processing steps, but they
appear only in the clusteredPoints output and do not impact the cluster
generation itself.
K-means, for example, processes every point in every iteration and assigns each
point to one cluster based upon minimum distance. Only at the end of an
iteration is the centroid recalculated. There's really no concept of outliers
during cluster generation; every point contributes to the centroid of exactly
one cluster. In Canopy, points can contribute to the centroids of multiple
clusters if their distance is d<T1. In FuzzyK, each point will contribute to
the centroid of every cluster. In Dirichlet, each point contributes to the
centroid of one of the clusters based upon its pdf() and the Dirichlet
distribution mixture. This is typically not the same cluster in each iteration.
I really think outlier elimination is best implemented as a post processing
step.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira