[ 
https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992776#comment-12992776
 ] 

Lance Norskog commented on MAHOUT-563:
--------------------------------------

bq. I assume the algorithms is fine enough to commit as a start. Anyone know of 
a better way, or is this already done?
The algorithm is a placeholder. Studying the problem again, I think the 
algorithm should minimize overlaps.
Here's a possibility: use Floyd's Algorithm to find all of the distances 
between points. Then, make a histogram of the distances and choose T1 and T2 
from different percentiles.
[Floyd-Warshall-Roy 
Algorithm|http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm]
bq. or it should go somewhere but utils/?
It is not intended as a separate utility program. You would use it in code 
directly. The 'make a canopy' job would use this by default, and then add an 
option to let you specify T1/T2.
bq. Should those T1/T2 values be output, but not the canopies?
This is the point of having an object that stashes them all. The algorithm 
explicitly subsamples the data merely to get the distances. It is possible that 
the generated canopies will be good enough for many applications, so some way 
to pull them out is needed. Also it should store the sampled vectors- those can 
be reused for other purposes.


> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly 
> unsatisfying task. This class estimates T1 and T2 numbers given the original 
> set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to