benwtrent commented on code in PR #14963:
URL: https://github.com/apache/lucene/pull/14963#discussion_r2446034702
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104HnswScalarQuantizedVectorsFormat.java:
##########
@@ -117,6 +152,7 @@ public Lucene104HnswScalarQuantizedVectorsFormat(
}
this.maxConn = maxConn;
this.beamWidth = beamWidth;
+ this.tinySegmentsThreshold = tinySegmentsThreshold;
Review Comment:
Could you apply this value to the writer? Seems like you missed that part
here.
##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene99/Lucene99HnswScalarQuantizedVectorsFormat.java:
##########
@@ -63,9 +64,24 @@ public class Lucene99HnswScalarQuantizedVectorsFormat
extends KnnVectorsFormat {
/** The format for storing, reading, merging vectors on disk */
private final FlatVectorsFormat flatVectorsFormat;
+ /**
Review Comment:
I don't think the backward_codecs need changing. This value is only applied
at `write` time, and consequently will never be used in backwards codecs.
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java:
##########
@@ -332,18 +332,18 @@ private void search(
final RandomVectorScorer scorer = scorerSupplier.get();
final KnnCollector collector =
new OrdinalTranslatedKnnCollector(knnCollector, scorer::ordToDoc);
- HnswGraph graph = getGraph(fieldEntry);
// Take into account if quantized? E.g. some scorer cost?
// Use approximate cardinality as this is good enough, but ensure we don't
exceed the graph
// size as that is illogical
+ HnswGraph graph = getGraph(fieldEntry);
int filteredDocCount = Math.min(acceptDocs.cost(), graph.size());
Bits accepted = acceptDocs.bits();
final Bits acceptedOrds = scorer.getAcceptOrds(accepted);
int numVectors = scorer.maxOrd();
boolean doHnsw = knnCollector.k() < numVectors;
// The approximate number of vectors that would be visited if we did not
filter
int unfilteredVisit =
HnswGraphSearcher.expectedVisitedNodes(knnCollector.k(), graph.size());
- if (unfilteredVisit >= filteredDocCount) {
+ if (unfilteredVisit >= filteredDocCount || graph.size() == 0) {
Review Comment:
I love how simple this is!
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java:
##########
@@ -575,16 +612,26 @@ static int distFuncToOrd(VectorSimilarityFunction func) {
throw new IllegalArgumentException("invalid distance function: " + func);
}
+ private static boolean shouldCreateGraph(int k, int numNodes) {
+ int expectedVisitedNodes = expectedVisitedNodes(k, numNodes);
+ return (numNodes > expectedVisitedNodes && expectedVisitedNodes > 0) || k
< 0;
Review Comment:
I think the `|| k < 0` is unnecessary. But if you want to exit early, I
would simply do an `if (k == 0) return true;`
It seems nicer to me for our "always build graph" to be `0`. That makes more
sense than allowing a negative number.
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##########
@@ -152,7 +177,20 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph
construction.
*/
public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
- this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+ this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null,
HNSW_GRAPH_THRESHOLD, VERSION_CURRENT);
+ }
+
+ /**
+ * Constructs a format using the given graph construction parameters.
+ *
+ * @param maxConn the maximum number of connections to a node in the HNSW
graph
+ * @param beamWidth the size of the queue maintained during graph
construction.
+ * @param tinySegmentsThreshold the value of k for the expectedVisitedNodes
heuristic, used to
+ * determine the minimum required graph nodes
Review Comment:
Can we be clearer about the allowed values and what they mean.
It seems to me that:
- `0` indicates that the graph is always built
- `> 0` indicates that the graph requires more nodes to be added before its
built
- Negative values aren't allowed.
Also, let's not say "expectedVisitedNodes" heuristic, I am not sure that is
actually documented anywhere. Let's simply say "the expected number of vector
operations to return `k` nearest neighbors of the current graph size".
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene102/Lucene102HnswBinaryQuantizedVectorsFormat.java:
##########
@@ -64,9 +65,17 @@ public class Lucene102HnswBinaryQuantizedVectorsFormat
extends KnnVectorsFormat
private final int numMergeWorkers;
private final TaskExecutor mergeExec;
+ /**
Review Comment:
We should also move this one to backwards codecs (I didn't do that yet, my
bad...). All HNSW quantized logic is now handled in the single 104 format.
Could you just update the 104 format and leave this one alone?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]