Ted, everyone,

I added the multiple pass BallKMeans at the end of the reducer and
have uncovered a series of issues with the searchers.

FastProjectionSearch stores "pending" vectors in a separate list from
the main projections, the idea being that use arrays throughout rather
than a tree multiset to reduce the overhead.

Here's what's happening:
- after having added a bunch of centroids to the searcher,
- we iterate through each centroid and search for its closest neighbor,
- when calling search(), reindex() is called internally that deals
with pending entries that haven't been merged in some time. When
reindex() mutates the data structures, the iterator we're using to go
through the centroids becomes invalid and a
ConcurrentModificationException is thrown.

I'm not really sure what the best approach is to fix this. Here are
some options:
- expose reindex() and demand users to call it explicitly (which
completely kills abstraction);
- when instantiating an Iterator, create a helper ImmutableList that
keeps track of the pending values for the duration of the iteration
and when iterating go through these first and release the reference to
the list as soon as possible;
- come up with an alternative design and really think about how this
searcher is useful.

In particular, when running minimal tests (SearchQualityTest), this
FPS is faster than ordinary ProjectionSearch but worse than LSH.

Any thoughts?

Reply via email to