Re: MeanShift Clustering duplicating vectors in canopies?

Christoph Hermann Tue, 26 Jan 2010 04:25:28 -0800

Am Montag, 25. Januar 2010 schrieben Sie:

Hello,


> The Meanshift canopy keeps copies of all the input points it has
> accreted. It does this for bookkeeping purposes, so that points can
>  be associated with each canopy when it is done, but this clearly
>  does not scale and is currently a showstopper for its utility in
>  large problems (despite the M/R implementation, a large number of
>  points will converge to a smaller number of very large cluster
>  descriptions). I've considered two ways to improve this situation:
>  1) associate identifiers with each point and just store the ids
>  instead of the whole point; 2) write out the accreted/merged
>  canopies to a separate log file so that final cluster membership can
>  be calculated after the fact. Option 1 would be the easiest to
>  implement but would only give an order-constant improvement in
>  space. Option 2 would solve the cluster space problem but would
>  introduce another post-processing step to track the cluster merges.

ok, thats good to know, thanks for the explanation.

I'm currently making some small experiments with the different 
clustering implementations of mahout and to be honest although i think i  
understood how the algorithms work, i have some problems with the code.

I choose MeanShift first, because it allows me to start without 
specifying the number of clusters and because i easily found out how to 
check which vector belongs to which canopy after running the algorithm.

> Unlike the other clustering algorithms, which define symmetrical
>  regions of n-space for each cluster, Meanshift clusters are
>  asymmetric and so points cannot be clustered after the fact using
>  just the cluster centers and distance measure.

Ok. Since a 'Cluster' (using k-means) does not seem to have references 
to the vectors it contains, how do i get all the vectors belonging to 
one cluster?

Do i have to iterate over all points and calculate the nearest cluster 
again? That looks very inefficient to me, maybe you can point me in the 
right direction.

What i need to do is to locate a vector in the list of clusters and list 
all the other vectors belonging to the same cluster.

> I'm not sure why you are getting duplicate copies of the same point
>  in your canopy. Your code looks like it was derived from the
> testReferenceImplementation unit test but has some minor differences.
> Why, since the code adds all the points to a new set of canopies
>  before iterating, are you passing in 'canopies' as an argument? Can
>  you say more about your input data set and the T1 & T2 values you
>  used? How many iterations occurred? What was your convergence test
>  value?

Actually i took it from "DisplayMeanShift" class.
I'm putting in "canopies" (which is currently just an empty list) bc i 
wanted to extend the code lateron. For now its useless.

My input data is a distribution of downloads of files, so for each day i 
have the day, the id of a file, and the number of downloads.
I'm selecting for a certain period of time (i.e. 7 days, starting at day 
x) the download values and try to cluster similar download patterns.
T1 and T2 are varying, i'm still trying to find the best values.

> Finally, our Vector library has improved its asFormatString in a
>  number of areas but at the cost of readibility. This makes debugging
>  terribly difficult and some sort of debuggable formatter is needed.

At least it shows me the ids and values of the vector, thats how i found 
out the canopy contains duplicates; i'll improve that lateron.

regards
Christoph

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: [email protected]

smime.p7s
Description: S/MIME cryptographic signature

Re: MeanShift Clustering duplicating vectors in canopies?

Reply via email to