kaivalnp commented on code in PR #15607:
URL: https://github.com/apache/lucene/pull/15607#discussion_r2742718506
##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -470,12 +476,18 @@ static void popToScratch(GraphBuilderKnnCollector
candidates, NeighborArray scra
*/
private boolean diversityCheck(float score, NeighborArray neighbors,
RandomVectorScorer scorer)
throws IOException {
+ int bulkCount = 0;
+ final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2,
bulkScoreNodes.length);
Review Comment:
IIUC
[`neighbors.nodes()`](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java#L206-L213)
returns the [internal
array](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/internal/hppc/IntArrayList.java#L45-L49)
used to store neighbor nodes, which can be larger than actual number of
neighbors and is exponentially growing -- can we have edge cases where the
length is about twice the actual number of neighbors, causing us to bulk score
everything?
I wonder if we should use `neighbors.size()` instead?
##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -470,12 +476,18 @@ static void popToScratch(GraphBuilderKnnCollector
candidates, NeighborArray scra
*/
private boolean diversityCheck(float score, NeighborArray neighbors,
RandomVectorScorer scorer)
throws IOException {
+ int bulkCount = 0;
+ final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2,
bulkScoreNodes.length);
for (int i = 0; i < neighbors.size(); i++) {
Review Comment:
Looks like we don't need to perform any operation per-node of this loop, can
we reduce some iterations using something like:
```java
private boolean diversityCheck(float score, NeighborArray neighbors,
RandomVectorScorer scorer)
throws IOException {
final int chunk = Math.min(Math.ceilDiv(neighbors.size(), 2),
bulkScoreNodes.length);
for (int start = 0; start < neighbors.size(); start += chunk) {
int length = Math.min(neighbors.size() - start, chunk);
System.arraycopy(neighbors.nodes(), start, bulkScoreNodes, 0, length);
if (scorer.bulkScore(bulkScoreNodes, bulkScores, length) >= score) {
return false;
}
}
return true;
}
```
##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -67,6 +67,8 @@ public class HnswGraphBuilder implements HnswBuilder {
protected final int M; // max number of connections on upper layers
private final double ml;
+ private final int[] bulkScoreNodes; // for bulk scoring
+ private final float[] bulkScores; // for bulk scoring
Review Comment:
I was worried about thread-safety of these arrays (given we have concurrent
merging), but from [this
comment](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L39-L47)
it looks like instances of this class are not shared across threads, but
rather multiple instances of this class (across different threads) can operate
on a single `HnswGraph`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]