[GitHub] [lucene] msokolov commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

via GitHub Wed, 05 Jul 2023 11:45:44 -0700


msokolov commented on code in PR #12413:
URL: https://github.com/apache/lucene/pull/12413#discussion_r1253495097



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -256,6 +256,72 @@ public NeighborQueue searchLevel(
     return results;
   }
 
+  /**
+   * Function to find the best entry point from which to search the zeroth 
graph layer.
+   *
+   * @param query vector query with which to search
+   * @param vectors random access vector values
+   * @param graph the HNSWGraph
+   * @param visitLimit How many vectors are allowed to be visited
+   * @return An integer array whose first element is the best entry point, and 
second is the number
+   *     of candidates visited. Entry point of `-1` indicates visitation limit 
exceed
+   * @throws IOException When accessing the vector fails
+   */
+  private int[] findBestEntryPoint(
+      T query, RandomAccessVectorValues<T> vectors, HnswGraph graph, int 
visitLimit)
+      throws IOException {
+    int size = graph.size();
+    int visitedCount = 1;
+    prepareScratchState(vectors.size());
+    final NeighborQueue results = new NeighborQueue(1, false);
+    int currentEp = graph.entryNode();
+    float currentScore = compare(query, vectors, currentEp);
+    float minAcceptedSimilarity = currentScore;
+    results.add(currentEp, currentScore);
+    for (int level = graph.numLevels() - 1; level >= 1; level--) {
+      candidates.add(currentEp, currentScore);
+      visited.set(currentEp);
+      // Keep searching the given level until we stop finding a better 
candidate entry point
+      while (candidates.size() > 0) {
+        // get the best candidate (closest or best scoring)
+        float topCandidateSimilarity = candidates.topScore();
+        if (topCandidateSimilarity < minAcceptedSimilarity) {
+          break;
+        }
+
+        int topCandidateNode = candidates.pop();
+        graphSeek(graph, level, topCandidateNode);
+        int friendOrd;
+        while ((friendOrd = graphNextNeighbor(graph)) != NO_MORE_DOCS) {
+          assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + 
size;
+          if (visited.getAndSet(friendOrd)) {
+            continue;
+          }
+          if (visitedCount >= visitLimit) {
+            return new int[] {-1, visitedCount};
+          }
+          float friendSimilarity = compare(query, vectors, friendOrd);
+          visitedCount++;
+          if (friendSimilarity >= minAcceptedSimilarity) {
+            candidates.add(friendOrd, friendSimilarity);
+            if (results.insertWithOverflow(friendOrd, friendSimilarity) && 
results.size() >= 1) {

Review Comment:
   it seems a little odd to preserve the ceremony of adding to a priority queue 
that will always be of length 1, although I suppose this preserves the idea of 
length > 1? Maybe we would want to do that?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -204,26 +204,26 @@ private static <T> NeighborQueue search(
     if (initialEp == -1) {
       return new NeighborQueue(1, true);
     }
-    NeighborQueue results;
-    results = new NeighborQueue(1, false);
-    int[] eps = new int[] {graph.entryNode()};
-    int numVisited = 0;
-    for (int level = graph.numLevels() - 1; level >= 1; level--) {
-      results.clear();
-      graphSearcher.searchLevel(results, query, 1, level, eps, vectors, graph, 
null, visitedLimit);
-
-      numVisited += results.visitedCount();
-      visitedLimit -= results.visitedCount();
-
-      if (results.incomplete()) {
-        results.setVisitedCount(numVisited);

Review Comment:
   Wouldn't it be simpler to discard results here? There will never be more 
than one, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

Reply via email to