[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121364#comment-13121364
 ] 

Ted Dunning commented on MAHOUT-825:
------------------------------------

If you want to do something clever with these outliers, add a few more clusters 
to the mix.

Or use Dirichlet process clustering which inherently uses a variable number of 
clusters.

Or add an option for trimming post facto as Jeff suggests (but absolutely not 
with a fixed distance).  Cluster radii are widely different and a single number 
won't do as a threshold.  You also have to be careful about trimming because 
trimming a point from the cluster can decrease the average radius and the std 
which can lead to other deletions.  You have to make sure that the outlier 
detection is stable.  It is also common that the points in a cluster are not 
normally distributed at all.  More typically, clusters in dense areas tend to 
be polytopes bounded on all sides by other clusters.  Clusters on the periphery 
tend to be asymmetric with sharp boundaries on one side and a spray of 
potential outliers on other sides.  These situations make simple trimming 
difficult.

If the impact of outliers is just too heinous to contemplate, then changing the 
centroid to a trimmed centroid computation might help.  The idea is that you 
compute the centroid conventionally and then progressively remove the 10-30% 
most distant points from the centroid computation, but NOT from the cluster.  
Such a trimmed centroid is, much like a trimmed mean, much less sensitive to 
outliers.  Trimming the centroid computation might give the desired results 
without the bad effects of outlier elimination.

                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to