msokolov opened a new pull request #1930: URL: https://github.com/apache/lucene-solr/pull/1930
This adds a floating-point vector format building on the designs in Lucene-9004 and LUCENE-9322. This patch fully supports indexing and reading vectors with an iterator, and a random-access API. Support for search based on an NSW graph implementation is intended to follow soon, but I wanted to include the vector APIs that I needed to get that working, even though they are not yet used here, so eg it includes the definition of a scoring function and a nearest-neighbors search API, but no implementation of search yet. My intention is to keep the ANN implementation hidden, so graphs and other supporting data structures (eg we might want to support LSH or k-means clustering and so on) would be implementation details invoked by a configuration on the VectorField/VectorValues. At the moment you can specify a ScoringFunction, and it is implicit that NSW will be the result. In the future we could add another parameter to ScoringFunction and/or new functions to represent support for other algorithms. Some open questions: 1. Should this be Lucene 9.0 only? In this patch I added Lucene90 Codec. If we do this then it would be awkward to backport. 2. It seems messy to have the ScoringFunction implementation in the main VectorValues interface API file. I'd appreciate any better suggestion for how to organize this. 3. Vector scoring can return negative numbers. I'd like to have first-class support for dot product distance (which can be negative) since that's what my consumers seem to have settled on. I don't think we need to be compatible with relevance scores, at least not directly in the KNN search API, but IDK maybe we should? We could renormalize/convert from dot-product scores to a positive score with math in the output layer where we return the scores. So far this is just specification question as there is no implementation of search yet. 4. I think there is room for improvement in some of the data structures used to map docids to dense vector ordinals and back. I'd appreciate comments on that, but maybe we could revisit in a fast follow-on issue? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org