Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Shashikant Kore Wed, 17 Jun 2009 07:35:09 -0700

On Wed, Jun 17, 2009 at 6:35 PM, Benson Margulies<[email protected]> wrote:
>
> As I read the paper, the idea here is to get a rough partitioning that is
> used to optimize various downstream algorithms, not to tune for a precise
> partitioning. The number of canopies doesn't need, as I read it, to be
> particularly close to the number of eventual partitions to be useful.
>
> Thus the extended discussion of how to start up and run various other
> algorithms, (e.g. k-means).
>


That's right. But here is my experience.  I ran Canopy and then
K-Means on 50k doc vectors. (That, by the way, is fraction of the
actual dataset.)  I used the code in the patch of 121, which uses
primitives for Sparse Vectors.

After some experimentation, for the t2 value of 0.9, I got only 1
cluster. When I changed it to 0.85, it generated 3000+ clusters(or
canopies). With increasing number of canopies the code starts
crawling. And after some time, even 2G memory is not sufficient for
it.

Canopies is one of the simplest clustering algorithm and I had trouble
getting it work. May be it's my data set. I simply didn't had the
patience to find out all the values of t1 and t2 which are anyway
going to change when the input changes. So, for now, I have just put a
cap on the number of canopies generated.  Not elegant, but results
don't seem bad at all.

OK. Now, let's not focus on my ignorance. I have got my hands dirty
with  Machine Learning, Mahout and Hadoop barely few days back.

--shashi

-- 
http://www.bandhan.com/

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Reply via email to