msokolov opened a new pull request #235:
URL: https://github.com/apache/lucene/pull/235


   This is a first cut at an implementation of a query based on K-nearest 
neighbors with the vector search being done as part of rewrite(). A couple of 
quirks:
   
   1. I noticed that scores for the default similarity (Euclidean) had very low 
precision as they got large. Because of the reverse nature: smaller distances 
mean higher scores, we need to invert in order to gain compatibility with 
Lucene search scores. The way we were handling this was to apply an 
`exp(-distance)` to convert distances to scores. That's theoretically sound, 
but in practice anything over 100 or so was underflowing to zero and becoming 
indistinguishable. As a stopgap measure, I changed the behavior so that the 
scores returned by vector search are allowed to be negative and get set to be 
`-distance` for the reverse-score (Euclidean distance) case. It's in theory OK 
for these to be negative as long as they are not directly used as Lucene result 
scores. I added a further conversion in the Query implementation here that 
simply adds an offset of the minimum score *for this query*. This is perfectly 
valid for a single query, but not comparable across queries, and indeed, not
  even across the same query run on multiple indexes, so it would present 
problems for distributed implementations. I'm not sure what to do about this 
yet, and looking for suggestions.
   2. There's a clever implementation (hack?!) to deal with trying to minimize 
over-collection across multiple segments. Basically the idea is to 
optimistically collect the expected proportion of top K based on the segment 
size (plus a margin), and then to re-run the query if we can't prove we 
exhaustively searched the segment. I think it's sound, but welcome comments on 
that bit since it's a little exotic.
   
   Finally I don't know whether this ought to get pushed without some 
performance testing; I'll start working on that soon using luceneutil.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to