benwtrent commented on code in PR #14963:
URL: https://github.com/apache/lucene/pull/14963#discussion_r2446034702


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104HnswScalarQuantizedVectorsFormat.java:
##########
@@ -117,6 +152,7 @@ public Lucene104HnswScalarQuantizedVectorsFormat(
     }
     this.maxConn = maxConn;
     this.beamWidth = beamWidth;
+    this.tinySegmentsThreshold = tinySegmentsThreshold;

Review Comment:
   Could you apply this value to the writer? Seems like you missed that part 
here.



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene99/Lucene99HnswScalarQuantizedVectorsFormat.java:
##########
@@ -63,9 +64,24 @@ public class Lucene99HnswScalarQuantizedVectorsFormat 
extends KnnVectorsFormat {
   /** The format for storing, reading, merging vectors on disk */
   private final FlatVectorsFormat flatVectorsFormat;
 
+  /**

Review Comment:
   I don't think the backward_codecs need changing. This value is only applied 
at `write` time, and consequently will never be used in backwards codecs.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java:
##########
@@ -332,18 +332,18 @@ private void search(
     final RandomVectorScorer scorer = scorerSupplier.get();
     final KnnCollector collector =
         new OrdinalTranslatedKnnCollector(knnCollector, scorer::ordToDoc);
-    HnswGraph graph = getGraph(fieldEntry);
     // Take into account if quantized? E.g. some scorer cost?
     // Use approximate cardinality as this is good enough, but ensure we don't 
exceed the graph
     // size as that is illogical
+    HnswGraph graph = getGraph(fieldEntry);
     int filteredDocCount = Math.min(acceptDocs.cost(), graph.size());
     Bits accepted = acceptDocs.bits();
     final Bits acceptedOrds = scorer.getAcceptOrds(accepted);
     int numVectors = scorer.maxOrd();
     boolean doHnsw = knnCollector.k() < numVectors;
     // The approximate number of vectors that would be visited if we did not 
filter
     int unfilteredVisit = 
HnswGraphSearcher.expectedVisitedNodes(knnCollector.k(), graph.size());
-    if (unfilteredVisit >= filteredDocCount) {
+    if (unfilteredVisit >= filteredDocCount || graph.size() == 0) {

Review Comment:
   I love how simple this is!



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java:
##########
@@ -575,16 +612,26 @@ static int distFuncToOrd(VectorSimilarityFunction func) {
     throw new IllegalArgumentException("invalid distance function: " + func);
   }
 
+  private static boolean shouldCreateGraph(int k, int numNodes) {
+    int expectedVisitedNodes = expectedVisitedNodes(k, numNodes);
+    return (numNodes > expectedVisitedNodes && expectedVisitedNodes > 0) || k 
< 0;

Review Comment:
   I think the `|| k < 0` is unnecessary. But if you want to exit early, I 
would simply do an `if (k == 0) return true;`
   
   It seems nicer to me for our "always build graph" to be `0`. That makes more 
sense than allowing a negative number.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##########
@@ -152,7 +177,20 @@ public Lucene99HnswVectorsFormat() {
    * @param beamWidth the size of the queue maintained during graph 
construction.
    */
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-    this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+    this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null, 
HNSW_GRAPH_THRESHOLD, VERSION_CURRENT);
+  }
+
+  /**
+   * Constructs a format using the given graph construction parameters.
+   *
+   * @param maxConn the maximum number of connections to a node in the HNSW 
graph
+   * @param beamWidth the size of the queue maintained during graph 
construction.
+   * @param tinySegmentsThreshold the value of k for the expectedVisitedNodes 
heuristic, used to
+   *     determine the minimum required graph nodes

Review Comment:
   Can we be clearer about the allowed values and what they mean.
   
   It seems to me that:
   
    - `0` indicates that the graph is always built
    - `> 0` indicates that the graph requires more nodes to be added before its 
built
    - Negative values aren't allowed.
   
   Also, let's not say "expectedVisitedNodes" heuristic, I am not sure that is 
actually documented anywhere. Let's simply say "the expected number of vector 
operations to return `k` nearest neighbors of the current graph size". 



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102HnswBinaryQuantizedVectorsFormat.java:
##########
@@ -64,9 +65,17 @@ public class Lucene102HnswBinaryQuantizedVectorsFormat 
extends KnnVectorsFormat
   private final int numMergeWorkers;
   private final TaskExecutor mergeExec;
 
+  /**

Review Comment:
   We should also move this one to backwards codecs (I didn't do that yet, my 
bad...). All HNSW quantized logic is now handled in the single 104 format. 
Could you just update the 104 format and leave this one alone?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to