krickert commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3865221804

   # Refactored Implementation: Floor-Score Sharing & Real-World Scaling 
Analysis
   
   Thanks for the feedback. I've refactored the `CollaborativeKnnCollector` to 
share the k-th best score (floor) instead of the absolute best, and validated 
it against real-world 1024-dimension embeddings from classic literature (73k 
sentences).
   
   ### 1. Key Change: Floor-Score Accumulation
   The previous implementation accumulated every collected doc's score into the 
`LongAccumulator(Math::max)`, which converged to the **single best score** 
across all segments. This was too aggressive - it gave ~75% visit reduction but 
only 0.32 recall.
   
   The fix: only accumulate the **floor score** (k-th best) once the local 
queue is full:
   ```java
   @Override
   public boolean collect(int docId, float similarity) {
       boolean collected = super.collect(docId, similarity);
       if (collected) {
           float floorScore = super.minCompetitiveSimilarity();
           if (floorScore > Float.NEGATIVE_INFINITY) {
               minScoreAcc.accumulate(DocScoreEncoder.encode(docId + docBase, 
floorScore));
           }
       }
       return collected;
   }
   ```
   This gives a gentler global bar that maintains near-perfect recall while 
still enabling cross-segment pruning.
   
   ### 2. Real-World Validation (73k Embeddings, 1024 Dimensions)
   
   Using real embeddings from War and Peace and other classic literature texts 
(72,969 sentences, 1024-dim, DOT_PRODUCT), we swept across segment counts with 
k=1000:
   
   | Segments | Docs/Seg | Std Visited | Col Visited | **Reduction** | Std 
Recall | Col Recall | Recall Delta |
   | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
   | 4 | 18,242 | 36,162 | 34,977 | **3.3%** | 0.997 | 0.995 | -0.002 |
   | 8 | 9,121 | 52,977 | 51,784 | **2.3%** | 0.999 | 0.999 | +0.000 |
   | 16 | 4,560 | 72,960 | 72,960 | 0.0% | 1.000 | 1.000 | +0.000 |
   | 32 | 2,280 | 72,960 | 72,960 | 0.0% | 1.000 | 1.000 | +0.000 |
   
   > **SIDENOTE**: Keep in mind, this test was done on a high iops machine.  I 
suspect slower machines would see higher reductions?  I can make the testing 
data here file a dependency in my maven central repo if you'd like.  Then we 
can find a way to have the monster tests run this nightly under the right data 
set?  Let me know if there's a standard your team does to set a larger data set 
for testing.
   
   ### 3. What This Tells Us
   
   **Within a single Lucene index, the improvement is ~3%.** Since Lucene 
already has an effective per-segment optimization 
(`TopKnnCollectorManager.isOptimistic()`) that does proportional per-leaf-k 
collection with reentry. The collaborative accumulator added a modest benefit 
on top of this already-optimized code path.  The real test would be a 
cross-shard test; I can work on that next.
   
   **I would next like to create a cross-shard (cross-index) pruning test with 
a larger set of data.** Lucene's built-in optimization stops at the index 
boundary. In OpenSearch or Solr, each shard is a separate Lucene index on a 
potentially different node. The `CollaborativeKnnCollector` + `LongAccumulator` 
extends the min-competitive-score pattern across shards - the gap identified in 
the feedback by @navneet1v.
   
   ### 4. Test Data Configuration
   
   The real-world validation test (`TestCollaborativeHnswRealWorld`) requires a 
73k-embedding binary file (`sentences_1024.bin`). The test is double-gated:
   *   `@Monster` annotation - skipped unless `-Dtests.monster=true`
   *   `assumeTrue` on the system property `tests.embeddings.dir` - skipped if 
not set
   
   To run:
   ```bash
   ./gradlew :lucene:core:test \
     --tests "org.apache.lucene.util.hnsw.TestCollaborativeHnswRealWorld" \
     -Dtests.monster=true \
     -Dtests.embeddings.dir=/path/to/data \
     -Dtests.verbose=true
   ```
   
   The binary format is `[int totalDocs][int dim]` then per doc: `[int 
textLen][byte[] text][float[] vector]`. We plan to publish this as a 
`test-lucene-embeddings` artifact under `ai.pipestream` so CI can pull it 
automatically.  But not sure if there's a better way. I can get this up in the 
m2 repo within the hour.
   
   ### 5. Code notes
   *   `CollaborativeKnnCollector.collect()` now shares the floor score (k-th 
best) instead of every collected score.
   *   Updated Javadoc in 
`CollaborativeKnnCollector.minCompetitiveSimilarity()` to document the 
Segment-0 tie-breaking design.
   *   Both test query classes disable Lucene's optimistic collection for fair 
apples-to-apples visited-count comparison.
   *   Fixed `forbiddenApis` violations by specifying `Locale.ROOT` in all test 
formatting.
   *   Real-world test uses `NoMergePolicy` + `setRAMBufferSizeMB(512)` for 
exact segment count control.
   
   ## Next steps
   
   So to get the shard performance that this PR intends to solve, a streaming 
service that hosts multiple indices needs to be the next PoC to show that this 
will improve our search for very high K values.  I plan to do this anyway 
because I'd like to implement this in OpenSearch (which has another hurdle I 
need to propose but think I figured out). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to