benwtrent opened a new pull request, #12413: URL: https://github.com/apache/lucene/pull/12413
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer. While trying to find the best entry point to gather the full candidate sets, we don't filter based on the acceptableOrds bitset. Consequently, if we exist the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset. Luckily since the results are marked as incomplete, the `*VectorQuery` logic switches back to an exact scan and throws away the results. However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug. I ran performance tests and there were no significant latency increases. There do seem to be observable latency decreases though at higher `maxConn` levels. I am getting slightly different recall. Usually better by 0.001, but worse by `0.001` on glove 100 with ``` nDoc fanout maxConn beamWidth 100000 20 96 500 120 ``` so I am digging into why that may be. Any help there is appreciated. Data (lucene util knnPerf): ``` dim = 100 doc_vectors = constants.GLOVE_VECTOR_DOCS_FILE query_vectors = '%s/util/tasks/vector-task-100d.vec' % constants.BASE_DIR ``` Settings ran: ``` VALUES = { 'ndoc': (100000,), 'maxConn': (32, 96), 'beamWidthIndex': (250, 500,), 'fanout': (20, 100,), } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org