I have been working on getting benchmarks working on the GloVe public data set and spent a while chasing down a bug with VectorValues.search that turned out to be a bug with the data (sort of)! When comparing vectors using an angular (dot product) measure, one has to normalize by the vectors' lengths. Given that the only purpose of such vectors is to compare them using dot-product, it would be sensible to normalize them *in advance* to unit length, rather than doing so for every comparison, yet this is not how this dataset at least is distributed on the internet, and widely-referenced benchmarking software such as ann-benchmarks assumes that code will handle such details internally.
I'm trying to see how we should handle this use case. We could provide a convenience function for normalizing while indexing. But should we? Would it happen when creating an IndexableField? When flushing? It's a little strange if you index a vector, and then retrieve it and its value is different! Alternatively we could simply expect users to perform such normalization, and throw an error if vectors intended for comparison using dot product (which is specified when adding a value) are not unit-length. But then again this is a somewhat costly operation that is only a safety measure, and users who already normalized their vectors would pay the cost needlessly. For now, I'm doing nothing, but I wonder if we could offer users some help here. -Mike --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org