[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121364#comment-13121364
]
Ted Dunning commented on MAHOUT-825:
------------------------------------
If you want to do something clever with these outliers, add a few more clusters
to the mix.
Or use Dirichlet process clustering which inherently uses a variable number of
clusters.
Or add an option for trimming post facto as Jeff suggests (but absolutely not
with a fixed distance). Cluster radii are widely different and a single number
won't do as a threshold. You also have to be careful about trimming because
trimming a point from the cluster can decrease the average radius and the std
which can lead to other deletions. You have to make sure that the outlier
detection is stable. It is also common that the points in a cluster are not
normally distributed at all. More typically, clusters in dense areas tend to
be polytopes bounded on all sides by other clusters. Clusters on the periphery
tend to be asymmetric with sharp boundaries on one side and a spray of
potential outliers on other sides. These situations make simple trimming
difficult.
If the impact of outliers is just too heinous to contemplate, then changing the
centroid to a trimmed centroid computation might help. The idea is that you
compute the centroid conventionally and then progressively remove the 10-30%
most distant points from the centroid computation, but NOT from the cluster.
Such a trimmed centroid is, much like a trimmed mean, much less sensitive to
outliers. Trimming the centroid computation might give the desired results
without the bad effects of outlier elimination.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira