I have been working on getting benchmarks working on the GloVe public
data set and spent a while chasing down a bug with VectorValues.search
that turned out to be a bug with the data (sort of)! When comparing
vectors using an angular (dot product) measure, one has to normalize
by the vectors' lengths. Given that the only purpose of such vectors
is to compare them using dot-product, it would be sensible to
normalize them *in advance* to unit length, rather than doing so for
every comparison, yet this is not how this dataset at least is
distributed on the internet, and widely-referenced benchmarking
software such as ann-benchmarks assumes that code will handle such
details internally.

I'm trying to see how we should handle this use case. We could provide
a convenience function for normalizing while indexing. But should we?
Would it happen when creating an IndexableField? When flushing? It's a
little strange if you index a vector, and then retrieve it and its
value is different! Alternatively we could simply expect users to
perform such normalization, and throw an error if vectors intended for
comparison using dot product (which is specified when adding a value)
are not unit-length. But then again this is a somewhat costly
operation that is only a safety measure, and users who already
normalized their vectors would pay the cost needlessly.

For now, I'm doing nothing, but I wonder if we could offer users some help here.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to