ChrisHegarty commented on code in PR #14963:
URL: https://github.com/apache/lucene/pull/14963#discussion_r2447185762
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##########
@@ -116,6 +116,18 @@ public final class Lucene99HnswVectorsFormat extends
KnnVectorsFormat {
/** Default to use single thread merge */
public static final int DEFAULT_NUM_MERGE_WORKER = 1;
+ /**
+ * Threshold which HnswGraphSearcher#expectedVisitedNodes uses as k to
determine when HNSW graph
+ * building is bypassed (useful in case if frequent flushes). It is in terms
of k for a graph i.e.
+ * number of docs to match for the query. So having a graph only helps if,
+ *
+ * <pre> k << size / log(size) </pre>
+ *
+ * i.e. k is at least 1 order less than size / log(size) where size if the
number of nodes in the
+ * graph
+ */
+ public static final int HNSW_GRAPH_THRESHOLD = 100;
Review Comment:
I struggled a little with this description. And also would find it hugely
beneficial to have a centralised location describing this optimisation. I'll
make a concrete suggestion, but please take it as just that and reword/rephrase
as you see fit. And please add corrections if I missed something.
```
/**
* Minimum estimated search effort (in terms of expected visited nodes)
* required before building an HNSW graph for a segment.
*
* <p>This threshold is compared against the value produced by
* {@link HnswGraphSearcher#expectedVisitedNodes(int, int)}, which
estimates
* how many nodes would be visited during a vector search based on the
current
* graph size and {@code k} (neighbours to find).
*
* <p>If the estimated number of visited nodes falls below this threshold,
* HNSW graph construction is skipped for that segment - typically for
small
* flushes or low document count segments - since the overhead of building
* the graph would outweigh its search benefits.
*
* <p>Conceptually, this corresponds to cases where:
*
* <pre>{@code
* k << size / log(size)
* }</pre>
*
* meaning that the target {@code k} (number of nearest neighbors) is at
least
* an order of magnitude smaller than {@code size / log(size)}, where
* {@code size} is the number of nodes (vectors) in the segment.
*
* <p>Default: {@code 100}
*/
public static final int HNSW_GRAPH_THRESHOLD = 100;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]