[
https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992776#comment-12992776
]
Lance Norskog commented on MAHOUT-563:
--------------------------------------
bq. I assume the algorithms is fine enough to commit as a start. Anyone know of
a better way, or is this already done?
The algorithm is a placeholder. Studying the problem again, I think the
algorithm should minimize overlaps.
Here's a possibility: use Floyd's Algorithm to find all of the distances
between points. Then, make a histogram of the distances and choose T1 and T2
from different percentiles.
[Floyd-Warshall-Roy
Algorithm|http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm]
bq. or it should go somewhere but utils/?
It is not intended as a separate utility program. You would use it in code
directly. The 'make a canopy' job would use this by default, and then add an
option to let you specify T1/T2.
bq. Should those T1/T2 values be output, but not the canopies?
This is the point of having an object that stashes them all. The algorithm
explicitly subsamples the data merely to get the distances. It is possible that
the generated canopies will be good enough for many applications, so some way
to pull them out is needed. Also it should store the sampled vectors- those can
be reused for other purposes.
> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
> Key: MAHOUT-563
> URL: https://issues.apache.org/jira/browse/MAHOUT-563
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Lance Norskog
> Assignee: Sean Owen
> Priority: Minor
> Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly
> unsatisfying task. This class estimates T1 and T2 numbers given the original
> set.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira