Hi again Christoph,
Further debugging of my test code indicates I was using List.contains()
to test for duplicates and this uses o.equals() and not == so it was
erroneously claiming the boundPoints had duplicates because the *values*
of the vectors were the same and not the *identities* thereof. So,
unless you can provide a dataset which exhibits the problem in the
reference implementation you are running I cannot help much more.
PS: if your dataset happens to contain duplicate-valued points then it
would be reasonable to see them all in an output cluster. Perhaps
assigning names to your vectors - which are encoded in their writable
state - will help to resolve this for you.
PPS: assigning unique names to the data points will be a pre-requisite
to either of the optimization strategies I mentioned earlier in this thread.
Jeff
Jeff Eastman wrote:
Hi Christoph,
The only unit test which exhibits this problem is the one which runs
the full MR job (testCanopyEuclideanMRJob()). This is darn hard to
debug and is doubly baffling since all the vectors should be read from
Writable format into new, distinct instances. If you have a small
dataset which exhibits the problem while running the reference
implementation it would be very nice if you could share it.
Jeff
Jeff Eastman wrote:
I added some test code to detect duplicate boundPoint entries and can
duplicate the issue in a unit test. I will see what is happening and
let you know.
Jeff