krickert commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3865221804
# Refactored Implementation: Floor-Score Sharing & Real-World Scaling
Analysis
Thanks for the feedback. I've refactored the `CollaborativeKnnCollector` to
share the k-th best score (floor) instead of the absolute best, and validated
it against real-world 1024-dimension embeddings from classic literature (73k
sentences).
### 1. Key Change: Floor-Score Accumulation
The previous implementation accumulated every collected doc's score into the
`LongAccumulator(Math::max)`, which converged to the **single best score**
across all segments. This was too aggressive - it gave ~75% visit reduction but
only 0.32 recall.
The fix: only accumulate the **floor score** (k-th best) once the local
queue is full:
```java
@Override
public boolean collect(int docId, float similarity) {
boolean collected = super.collect(docId, similarity);
if (collected) {
float floorScore = super.minCompetitiveSimilarity();
if (floorScore > Float.NEGATIVE_INFINITY) {
minScoreAcc.accumulate(DocScoreEncoder.encode(docId + docBase,
floorScore));
}
}
return collected;
}
```
This gives a gentler global bar that maintains near-perfect recall while
still enabling cross-segment pruning.
### 2. Real-World Validation (73k Embeddings, 1024 Dimensions)
Using real embeddings from War and Peace and other classic literature texts
(72,969 sentences, 1024-dim, DOT_PRODUCT), we swept across segment counts with
k=1000:
| Segments | Docs/Seg | Std Visited | Col Visited | **Reduction** | Std
Recall | Col Recall | Recall Delta |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 4 | 18,242 | 36,162 | 34,977 | **3.3%** | 0.997 | 0.995 | -0.002 |
| 8 | 9,121 | 52,977 | 51,784 | **2.3%** | 0.999 | 0.999 | +0.000 |
| 16 | 4,560 | 72,960 | 72,960 | 0.0% | 1.000 | 1.000 | +0.000 |
| 32 | 2,280 | 72,960 | 72,960 | 0.0% | 1.000 | 1.000 | +0.000 |
> **SIDENOTE**: Keep in mind, this test was done on a high iops machine. I
suspect slower machines would see higher reductions? I can make the testing
data here file a dependency in my maven central repo if you'd like. Then we
can find a way to have the monster tests run this nightly under the right data
set? Let me know if there's a standard your team does to set a larger data set
for testing.
### 3. What This Tells Us
**Within a single Lucene index, the improvement is ~3%.** Since Lucene
already has an effective per-segment optimization
(`TopKnnCollectorManager.isOptimistic()`) that does proportional per-leaf-k
collection with reentry. The collaborative accumulator added a modest benefit
on top of this already-optimized code path. The real test would be a
cross-shard test; I can work on that next.
**I would next like to create a cross-shard (cross-index) pruning test with
a larger set of data.** Lucene's built-in optimization stops at the index
boundary. In OpenSearch or Solr, each shard is a separate Lucene index on a
potentially different node. The `CollaborativeKnnCollector` + `LongAccumulator`
extends the min-competitive-score pattern across shards - the gap identified in
the feedback by @navneet1v.
### 4. Test Data Configuration
The real-world validation test (`TestCollaborativeHnswRealWorld`) requires a
73k-embedding binary file (`sentences_1024.bin`). The test is double-gated:
* `@Monster` annotation - skipped unless `-Dtests.monster=true`
* `assumeTrue` on the system property `tests.embeddings.dir` - skipped if
not set
To run:
```bash
./gradlew :lucene:core:test \
--tests "org.apache.lucene.util.hnsw.TestCollaborativeHnswRealWorld" \
-Dtests.monster=true \
-Dtests.embeddings.dir=/path/to/data \
-Dtests.verbose=true
```
The binary format is `[int totalDocs][int dim]` then per doc: `[int
textLen][byte[] text][float[] vector]`. We plan to publish this as a
`test-lucene-embeddings` artifact under `ai.pipestream` so CI can pull it
automatically. But not sure if there's a better way. I can get this up in the
m2 repo within the hour.
### 5. Code notes
* `CollaborativeKnnCollector.collect()` now shares the floor score (k-th
best) instead of every collected score.
* Updated Javadoc in
`CollaborativeKnnCollector.minCompetitiveSimilarity()` to document the
Segment-0 tie-breaking design.
* Both test query classes disable Lucene's optimistic collection for fair
apples-to-apples visited-count comparison.
* Fixed `forbiddenApis` violations by specifying `Locale.ROOT` in all test
formatting.
* Real-world test uses `NoMergePolicy` + `setRAMBufferSizeMB(512)` for
exact segment count control.
## Next steps
So to get the shard performance that this PR intends to solve, a streaming
service that hosts multiple indices needs to be the next PoC to show that this
will improve our search for very high K values. I plan to do this anyway
because I'd like to implement this in OpenSearch (which has another hurdle I
need to propose but think I figured out).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]