Hi again Christoph,

Further debugging of my test code indicates I was using List.contains() to test for duplicates and this uses o.equals() and not == so it was erroneously claiming the boundPoints had duplicates because the *values* of the vectors were the same and not the *identities* thereof. So, unless you can provide a dataset which exhibits the problem in the reference implementation you are running I cannot help much more.

PS: if your dataset happens to contain duplicate-valued points then it would be reasonable to see them all in an output cluster. Perhaps assigning names to your vectors - which are encoded in their writable state - will help to resolve this for you.

PPS: assigning unique names to the data points will be a pre-requisite to either of the optimization strategies I mentioned earlier in this thread.

Jeff

Jeff Eastman wrote:
Hi Christoph,

The only unit test which exhibits this problem is the one which runs the full MR job (testCanopyEuclideanMRJob()). This is darn hard to debug and is doubly baffling since all the vectors should be read from Writable format into new, distinct instances. If you have a small dataset which exhibits the problem while running the reference implementation it would be very nice if you could share it.

Jeff

Jeff Eastman wrote:
I added some test code to detect duplicate boundPoint entries and can duplicate the issue in a unit test. I will see what is happening and let you know.
Jeff


Reply via email to