atris opened a new pull request, #15613:
URL: https://github.com/apache/lucene/pull/15613

   Adds Lucene99SpannVectorsFormat, implementing Disk-Resident HNSW-IVF (SPANN) 
to support vector indices larger than available heap.
   
   This PR segregatse the index into a Coarse Quantizer (Centroids in HNSW) and 
the actual Data (Disk-resident inverted lists).
   
   Writer flow: Buffers vectors in heap, then runs K-Means++ on flush (using 
reservoir sampling for amortised performance). Writes centroids to the delegate 
format and vector data sequentially to .spad files.
   
   Reader flow: Performs a two-phase search. First phase queries HNSW for the 
nearest nprobe partitions. Second phase uses the chosen centroids from first 
phase and scans the candidate partitions on disk.
   
   Testing strategy includes unit tests for integrity, clustering correctness, 
and a recall validation test confirming that higher n-probe retrieves better 
results.
   
   Note: Merging currently uses the default implementation and requires heap 
proportional to segment size. Disk-based merging is a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to