msokolov opened a new pull request #1930:
URL: https://github.com/apache/lucene-solr/pull/1930


   This adds a floating-point vector format building on the designs in 
Lucene-9004 and LUCENE-9322. This patch fully supports indexing and reading 
vectors with an iterator, and a random-access API. Support for search based on 
an NSW graph implementation is intended to follow soon, but I wanted to include 
the vector APIs that I needed to get that working, even though they are not yet 
used here, so eg it includes the definition of a scoring function and a 
nearest-neighbors search API, but no implementation of search yet. My intention 
is to keep the ANN implementation hidden, so graphs and other supporting data 
structures (eg we might want to support LSH or k-means clustering and so on) 
would be implementation details invoked by a configuration on the 
VectorField/VectorValues. At the moment you can specify a ScoringFunction, and 
it is implicit that NSW will be the result. In the future we could add another 
parameter to ScoringFunction and/or new functions to represent support for 
other 
 algorithms.
   
   Some open questions: 
   
   1. Should this be Lucene 9.0 only? In this patch I added Lucene90 Codec. If 
we do this then it would be awkward to backport.
   2. It seems messy to have the ScoringFunction implementation in the main 
VectorValues interface API file. I'd appreciate any better suggestion for how 
to organize this.
   3. Vector scoring can return negative numbers. I'd like to have first-class 
support for dot product distance (which can be negative) since that's what my 
consumers seem to have settled on. I don't think we need to be compatible with 
relevance scores, at least not directly in the KNN search API, but IDK maybe we 
should? We could renormalize/convert from dot-product scores to a positive 
score with math in the output layer where we return the scores. So far this is 
just specification question as there is no implementation of search yet.
   4. I think there is room for improvement in some of the data structures used 
to map docids to dense vector ordinals and back. I'd appreciate comments on 
that, but maybe we could revisit in a fast follow-on issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to