All I know is what I learned from reading the paper. However, I continue to think, from reading the paper, that you may be trying to make Canopy do something it was not intended to do.
As I read the paper, the idea here is to get a rough partitioning that is used to optimize various downstream algorithms, not to tune for a precise partitioning. The number of canopies doesn't need, as I read it, to be particularly close to the number of eventual partitions to be useful. Thus the extended discussion of how to start up and run various other algorithms, (e.g. k-means). Now, still, you need to get some useful number of partitions. The paper has a classic toss-off line, 'we used cross-validation,' without any details about exactly what the authors did. Presumably, that means that the author ran many possible values and hand-examined the results. The paper reports no general results about how sensitive the T values are to particular input data sets. A pessimist would fear that, for any new input, you're going to need to go through a lengthy process to find good values for T1 and T2. This leads me to wonder, ignorantly, why this project is so focused on Canopy. The paper describes it as a tool for speeding up various other things. Since you're hadooping all those other things, how much does it help? Anyway, I expect that my ignorance is on comprehensive display here. On Wed, Jun 17, 2009 at 7:16 AM, Grant Ingersoll <[email protected]>wrote: > Shashikant asked this over on mahout-dev, but I thought I would move it to > user so that others can benefit from the discussion. > > > On Jun 17, 2009, at 1:12 AM, Shashikant Kore (JIRA) wrote: > >> >> Shashikant Kore commented on MAHOUT-121: >> ---------------------------------------- >> >> >> [OT] Also, was wondering how you came up with the values of t1 and t2 as >> 1.3 & 1.0. This is voodoo for me. For the dataset I am working with has a >> window of 0.05 in which the result changes from 0 canopies to 3,000 >> canopies. >> > > I just picked some numbers based on what you did! It is voodoo to me too. > I have not done much clustering, so I'm learning a lot here. As for > MAHOUT-121, I just wanted something to run.
