[
https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992794#comment-12992794
]
Ted Dunning commented on MAHOUT-563:
------------------------------------
{quote}
Here's a possibility: use Floyd's Algorithm to find all of the distances
between points. Then, make a histogram of the distances and choose T1 and T2
from different percentiles.
Floyd-Warshall-Roy Algorithm
{quote}
Uhh... this isn't a graph.
And we try not to do anything that starts "all distances between all points"
because that way lies madness in terms of scalability.
Picking a thousand pairs of points and using those for your histogram might
work. At least you would get a reasonable approximation of the actual distance
histogram.
My guess is that the zone of applicability for a simple heuristic like this
will be kind of narrow. In my experience I have seen data that looks like
stuff on a hyper sphere. This has all distances clustered around a single
value but can still often be clustered well. I have also seen data that has
power law membership of clusters and the clusters have very different sizes.
In that case, the histogram is likely to show only the large clusters.
> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
> Key: MAHOUT-563
> URL: https://issues.apache.org/jira/browse/MAHOUT-563
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Lance Norskog
> Assignee: Sean Owen
> Priority: Minor
> Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly
> unsatisfying task. This class estimates T1 and T2 numbers given the original
> set.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira