msokolov opened a new pull request #235: URL: https://github.com/apache/lucene/pull/235
This is a first cut at an implementation of a query based on K-nearest neighbors with the vector search being done as part of rewrite(). A couple of quirks: 1. I noticed that scores for the default similarity (Euclidean) had very low precision as they got large. Because of the reverse nature: smaller distances mean higher scores, we need to invert in order to gain compatibility with Lucene search scores. The way we were handling this was to apply an `exp(-distance)` to convert distances to scores. That's theoretically sound, but in practice anything over 100 or so was underflowing to zero and becoming indistinguishable. As a stopgap measure, I changed the behavior so that the scores returned by vector search are allowed to be negative and get set to be `-distance` for the reverse-score (Euclidean distance) case. It's in theory OK for these to be negative as long as they are not directly used as Lucene result scores. I added a further conversion in the Query implementation here that simply adds an offset of the minimum score *for this query*. This is perfectly valid for a single query, but not comparable across queries, and indeed, not even across the same query run on multiple indexes, so it would present problems for distributed implementations. I'm not sure what to do about this yet, and looking for suggestions. 2. There's a clever implementation (hack?!) to deal with trying to minimize over-collection across multiple segments. Basically the idea is to optimistically collect the expected proportion of top K based on the segment size (plus a margin), and then to re-run the query if we can't prove we exhaustively searched the segment. I think it's sound, but welcome comments on that bit since it's a little exotic. Finally I don't know whether this ought to get pushed without some performance testing; I'll start working on that soon using luceneutil. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
