atris opened a new pull request, #15613: URL: https://github.com/apache/lucene/pull/15613
Adds Lucene99SpannVectorsFormat, implementing Disk-Resident HNSW-IVF (SPANN) to support vector indices larger than available heap. This PR segregatse the index into a Coarse Quantizer (Centroids in HNSW) and the actual Data (Disk-resident inverted lists). Writer flow: Buffers vectors in heap, then runs K-Means++ on flush (using reservoir sampling for amortised performance). Writes centroids to the delegate format and vector data sequentially to .spad files. Reader flow: Performs a two-phase search. First phase queries HNSW for the nearest nprobe partitions. Second phase uses the chosen centroids from first phase and scans the candidate partitions on disk. Testing strategy includes unit tests for integrity, clustering correctness, and a recall validation test confirming that higher n-probe retrieves better results. Note: Merging currently uses the default implementation and requires heap proportional to segment size. Disk-based merging is a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
