krickert commented on PR #15676: URL: https://github.com/apache/lucene/pull/15676#issuecomment-3874904257
# Summary Still a lot of trail and error but real-world tests and properly measuring recall show that a lot more testing is required to improve HNSW search. Finding the optimal collaborative search parameters requires a systematic sweep across multiple variables. The test harness will run a full combinatorial grid over K-values, shard counts, slack thresholds, and datasets to discover a "best configuration" for each scenario within this data set. The goal is to keep the formula simple: we expect K and numShards to produce a plottable pattern that yields a clean dynamic slack function. The primary dataset is built from Wikipedia embeddings. Suggestions for additional large-scale datasets are welcome, as effective testing requires indices large enough to stress the shard boundaries. Current tests are on a 16GB index. Before running the full grid, single-index recall is being validated against [luceneutil](https://github.com/mikemccand/luceneutil) to confirm that the collaborative mechanism introduces no regression at the individual shard level. Below is a matrix of tests I'm going to plot against the data: | Variable | Values | |-----------|-------------------------------------------| | K | 10, 25, 50, 100, 250, 500, 1000 | | numShards | 2, 4, 8, 16 | | slack | 0, 0.001, 0.005, 0.01, 0.02, 0.05 | | dataset | sentences-1024, paragraphs-1024 | # Test results so far Discovered some issues with the lower recall values due to dupes which invalidated earlier multi-shard tests. Deduped the index and redid baselines. The tests so far suggest a classic trade off where we lose recall as we get more aggressive. The test results below demonstrate what happened when I loosened it up too much - a slower search and more nodes visited instead of less. ## 8-Shard Simulation Results | K | Mode | Merged Recall | Avg Latency | Lookups Saved (Higher is Better) | |------|---------------|---------------|-------------|----------------------------------| | 10 | Baseline | 0.9690 | 55 ms | 1,446,967 | | 10 | Collaborative | 0.8864 | 50 ms | 1,455,295 (+8k saved) | | 100 | Baseline | 0.9885 | 84 ms | 1,405,451 | | 100 | Collaborative | 0.9900 | 96 ms | 1,354,091 (-51k saved) | | 1000 | Baseline | 0.9969 | 279 ms | 1,251,298 | | 1000 | Collaborative | 0.9995 | 620 ms | 884,881 (-366k saved) | ## Key Insights from the Run 1. **Recall Ceiling:** On deduped data, the baseline recall is excellent (0.97+). The collaborative version at $K=10$ still sees a drop of ~8 points, but at $K \ge 100$, it maintains (and slightly exceeds) baseline recall. 2. **The "Exploration Booster" Confirmed:** At $K=1000$, the collaborative search is doing significantly more work (620ms vs 279ms). This is the direct result of the $0.05$ safety slack and $2k$ warm-up guard being too generous for high-density searches. The search is "over-exploring" past the natural convergence point of standard HNSW. 3. **K=10 Sweet Spot:** For small $K$, the mechanism is actually working as intended for performance (faster than baseline), but the $0.05$ slack is still occasionally cutting off critical "bridge" paths needed to match the baseline's 0.97 recall. ## Analysis and Next Steps I'm going to focus on changing the slack value. The fixed $0.05$ slack is the primary culprit for the performance regression at large $K$. It creates an "infinite budget" effect. **Proposal:** We should now move to the **Dynamic Slack** implementation to tighten the pruning at high $K$ and find the "knee" in the recall curve. ``` Slack = BaseSlack * sqrt(numShards / K) ``` This would make the slack $\approx 0.04$ at $K=10$ but drop it to $\approx 0.004$ at $K=1000$, which should drastically cut the latency for large queries while keeping the recall gains we've seen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
