Re: MeanShift Clustering duplicating vectors in canopies?

Jeff Eastman Mon, 25 Jan 2010 13:25:47 -0800

Hi Christoph,

The Meanshift canopy keeps copies of all the input points it hasaccreted. It does this for bookkeeping purposes, so that points can beassociated with each canopy when it is done, but this clearly does notscale and is currently a showstopper for its utility in large problems(despite the M/R implementation, a large number of points will convergeto a smaller number of very large cluster descriptions). I've consideredtwo ways to improve this situation: 1) associate identifiers with eachpoint and just store the ids instead of the whole point; 2) write outthe accreted/merged canopies to a separate log file so that finalcluster membership can be calculated after the fact. Option 1 would bethe easiest to implement but would only give an order-constantimprovement in space. Option 2 would solve the cluster space problem butwould introduce another post-processing step to track the cluster merges.

Unlike the other clustering algorithms, which define symmetrical regionsof n-space for each cluster, Meanshift clusters are asymmetric and sopoints cannot be clustered after the fact using just the cluster centersand distance measure.

I'm not sure why you are getting duplicate copies of the same point inyour canopy. Your code looks like it was derived from thetestReferenceImplementation unit test but has some minor differences.Why, since the code adds all the points to a new set of canopies beforeiterating, are you passing in 'canopies' as an argument? Can you saymore about your input data set and the T1 & T2 values you used? How manyiterations occurred? What was your convergence test value?

Finally, our Vector library has improved its asFormatString in a numberof areas but at the cost of readibility. This makes debugging terriblydifficult and some sort of debuggable formatter is needed.


Jeff





Christoph Hermann wrote:

Hello,
i'm running some clustering with the Mean Shift and in my final canopy iget 5x the same vector.
In the original input list i only had it once and i'm wondering whyduplicates are allowed within the same canopy?
Attached is a file with the method i'm using to run mean shift as wellas the ouput (i'm iterating over the getBoundPoints() list of thecanopy).
I'd be happy if someone could explain this.

regards
Christoph Hermann

Re: MeanShift Clustering duplicating vectors in canopies?

Reply via email to