kaivalnp commented on code in PR #15607:
URL: https://github.com/apache/lucene/pull/15607#discussion_r2742718506


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -470,12 +476,18 @@ static void popToScratch(GraphBuilderKnnCollector 
candidates, NeighborArray scra
    */
   private boolean diversityCheck(float score, NeighborArray neighbors, 
RandomVectorScorer scorer)
       throws IOException {
+    int bulkCount = 0;
+    final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2, 
bulkScoreNodes.length);

Review Comment:
   IIUC 
[`neighbors.nodes()`](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java#L206-L213)
 returns the [internal 
array](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/internal/hppc/IntArrayList.java#L45-L49)
 used to store neighbor nodes, which can be larger than actual number of 
neighbors and is exponentially growing -- can we have edge cases where the 
length is about twice the actual number of neighbors, causing us to bulk score 
everything?
   
   I wonder if we should use `neighbors.size()` instead?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -470,12 +476,18 @@ static void popToScratch(GraphBuilderKnnCollector 
candidates, NeighborArray scra
    */
   private boolean diversityCheck(float score, NeighborArray neighbors, 
RandomVectorScorer scorer)
       throws IOException {
+    int bulkCount = 0;
+    final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2, 
bulkScoreNodes.length);
     for (int i = 0; i < neighbors.size(); i++) {

Review Comment:
   Looks like we don't need to perform any operation per-node of this loop, can 
we reduce some iterations using something like:
   
   ```java
   private boolean diversityCheck(float score, NeighborArray neighbors, 
RandomVectorScorer scorer)
       throws IOException {
     final int chunk = Math.min(Math.ceilDiv(neighbors.size(), 2), 
bulkScoreNodes.length);
     for (int start = 0; start < neighbors.size(); start += chunk) {
       int length = Math.min(neighbors.size() - start, chunk);
       System.arraycopy(neighbors.nodes(), start, bulkScoreNodes, 0, length);
       if (scorer.bulkScore(bulkScoreNodes, bulkScores, length) >= score) {
         return false;
       }
     }
     return true;
   }
   ```



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -67,6 +67,8 @@ public class HnswGraphBuilder implements HnswBuilder {
   protected final int M; // max number of connections on upper layers
   private final double ml;
 
+  private final int[] bulkScoreNodes; // for bulk scoring
+  private final float[] bulkScores; // for bulk scoring

Review Comment:
   I was worried about thread-safety of these arrays (given we have concurrent 
merging), but from [this 
comment](https://github.com/apache/lucene/blob/6a6b753e3080725921a07b0a214963d8ff639eea/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L39-L47)
 it looks like instances of this class are not shared across threads, but 
rather multiple instances of this class (across different threads) can operate 
on a single `HnswGraph`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to