[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378871#comment-17378871 ]
Julie Tibshirani commented on LUCENE-10016: ------------------------------------------- I'm sorry for jumping in late -- I actually think having a parameter here to control recall makes sense and that we should keep it. I agree it'd be good to rename general and not specific to HNSW though, for example in LUCENE-9322 we called it {{recallFactor}}. Explaining my reasoning -- in the current implementation, you can indeed just scale K in order to increase recall. But many other ANN algorithms have recall-tuning parameters that can't be controlled through K. Some examples: * ScaNN (the current leader in ann-benchmarks) is based on a quantization technique, where vectors are grouped into clusters or 'leaves'. There is a search-time parameter to control the number of leaves that are considered as candidates. This is a totally separate concept from K -- these candidates are never fully ranked against each other, to avoid unnecessary distance computations. * Multi-probe LSH (which I think is implemented in the elastiknn plugin?) has a number of probes 'T' defining the extra number of hash buckets to check per query. This is also separate from K, it increases the initial candidate set but not all of these vectors will be ranked and returned. In other places we've worked hard to keep the API general enough to support other implementations, and I see keeping this parameter as part of that effort. Not as important an example, but the HNSW algorithm also treats K as separate from its recall factor 'ef'. In the current-setup, we're able to align the API to the algorithm description in the paper and its reference implementations, which I think is easier to understand for users. > VectorReader.search needs rethought, o.a.l.search integration? > -------------------------------------------------------------- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Blocker > Fix For: 9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org