[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Ted Dunning (JIRA) Wed, 09 Feb 2011 15:41:20 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992794#comment-12992794
 ]


Ted Dunning commented on MAHOUT-563:
------------------------------------

{quote}
Here's a possibility: use Floyd's Algorithm to find all of the distances 
between points. Then, make a histogram of the distances and choose T1 and T2 
from different percentiles.
Floyd-Warshall-Roy Algorithm
{quote}

Uhh... this isn't a graph.

And we try not to do anything that starts "all distances between all points" 
because that way lies madness in terms of scalability.

Picking a thousand pairs of points and using those for your histogram might 
work.  At least you would get a reasonable approximation of the actual distance 
histogram.

My guess is that the zone of applicability for a simple heuristic like this 
will be kind of narrow.  In my experience I have seen data that looks like 
stuff on a hyper sphere.  This has all distances clustered around a single 
value but can still often be clustered well.  I have also seen data that has 
power law membership of clusters and the clusters have very different sizes.  
In that case, the histogram is likely to show only the large clusters.

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly 
> unsatisfying task. This class estimates T1 and T2 numbers given the original 
> set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Reply via email to