Ted, everyone, I added the multiple pass BallKMeans at the end of the reducer and have uncovered a series of issues with the searchers.
FastProjectionSearch stores "pending" vectors in a separate list from the main projections, the idea being that use arrays throughout rather than a tree multiset to reduce the overhead. Here's what's happening: - after having added a bunch of centroids to the searcher, - we iterate through each centroid and search for its closest neighbor, - when calling search(), reindex() is called internally that deals with pending entries that haven't been merged in some time. When reindex() mutates the data structures, the iterator we're using to go through the centroids becomes invalid and a ConcurrentModificationException is thrown. I'm not really sure what the best approach is to fix this. Here are some options: - expose reindex() and demand users to call it explicitly (which completely kills abstraction); - when instantiating an Iterator, create a helper ImmutableList that keeps track of the pending values for the duration of the iteration and when iterating go through these first and release the reference to the list as soon as possible; - come up with an alternative design and really think about how this searcher is useful. In particular, when running minimal tests (SearchQualityTest), this FPS is faster than ordinary ProjectionSearch but worse than LSH. Any thoughts?
