Re: [I] Cluster Based ANN Vector Search for Lucene [lucene]

via GitHub Mon, 26 Jan 2026 06:55:03 -0800


benwtrent commented on issue #15612:
URL: https://github.com/apache/lucene/issues/15612#issuecomment-3800009611


   I like the idea and agree with the incremental steps for getting it 
developed. Doing something like this will take time.
   
   One thing that I think is critical, the vectors in the postings lists must 
be quantized (I would think significantly so, e.g. 1 or 2 bit). Otherwise we 
are doubling the index size as to gain disk locality, vectors must be inline 
with the postings lists. 
   
   As for SPFresh, I think the algorithm is nice for incremental updates, but 
for merges in Apache Lucene, we will need to do something that isn't 
incremental. Likely bulk updates. But I suspect rebuilding from scratch will 
still end up being way faster than HNSW is now :)
   
   As for using HNSW as the centroid indexing structure, this feels natural. 
However, one other thing to consider is doing a "hybrid-hnsw", where the higher 
levels are coarser centroids, and the bottom layer is the finest centroids. But 
I think that is probably after moving on from regular kmeans (to hierarchical 
clustering or something).
   
   For postings overspilling, we need to be careful there. Doing it to the 
naively next nearest has shown to be pretty bad 
(https://arxiv.org/pdf/2601.07183). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Cluster Based ANN Vector Search for Lucene [lucene]

Reply via email to