Re: MeanShift Clustering duplicating vectors in canopies?

Jeff Eastman Tue, 26 Jan 2010 16:18:16 -0800

Hi again Christoph,

Further debugging of my test code indicates I was using List.contains()to test for duplicates and this uses o.equals() and not == so it waserroneously claiming the boundPoints had duplicates because the *values*of the vectors were the same and not the *identities* thereof. So,unless you can provide a dataset which exhibits the problem in thereference implementation you are running I cannot help much more.

PS: if your dataset happens to contain duplicate-valued points then itwould be reasonable to see them all in an output cluster. Perhapsassigning names to your vectors - which are encoded in their writablestate - will help to resolve this for you.

PPS: assigning unique names to the data points will be a pre-requisiteto either of the optimization strategies I mentioned earlier in this thread.


Jeff

Jeff Eastman wrote:

Hi Christoph,
The only unit test which exhibits this problem is the one which runsthe full MR job (testCanopyEuclideanMRJob()). This is darn hard todebug and is doubly baffling since all the vectors should be read fromWritable format into new, distinct instances. If you have a smalldataset which exhibits the problem while running the referenceimplementation it would be very nice if you could share it.
Jeff

Jeff Eastman wrote:
I added some test code to detect duplicate boundPoint entries and canduplicate the issue in a unit test. I will see what is happening andlet you know.
Jeff

Re: MeanShift Clustering duplicating vectors in canopies?

Reply via email to