On Wed, Jun 17, 2009 at 6:35 PM, Benson Margulies<[email protected]> wrote: > > As I read the paper, the idea here is to get a rough partitioning that is > used to optimize various downstream algorithms, not to tune for a precise > partitioning. The number of canopies doesn't need, as I read it, to be > particularly close to the number of eventual partitions to be useful. > > Thus the extended discussion of how to start up and run various other > algorithms, (e.g. k-means). >
That's right. But here is my experience. I ran Canopy and then K-Means on 50k doc vectors. (That, by the way, is fraction of the actual dataset.) I used the code in the patch of 121, which uses primitives for Sparse Vectors. After some experimentation, for the t2 value of 0.9, I got only 1 cluster. When I changed it to 0.85, it generated 3000+ clusters(or canopies). With increasing number of canopies the code starts crawling. And after some time, even 2G memory is not sufficient for it. Canopies is one of the simplest clustering algorithm and I had trouble getting it work. May be it's my data set. I simply didn't had the patience to find out all the values of t1 and t2 which are anyway going to change when the input changes. So, for now, I have just put a cap on the number of canopies generated. Not elegant, but results don't seem bad at all. OK. Now, let's not focus on my ignorance. I have got my hands dirty with Machine Learning, Mahout and Hadoop barely few days back. --shashi -- http://www.bandhan.com/
