Re: [I] Cluster Based ANN Vector Search for Lucene [lucene]

via GitHub Mon, 26 Jan 2026 09:53:42 -0800


vigyasharma commented on issue #15612:
URL: https://github.com/apache/lucene/issues/15612#issuecomment-3800886217


   @benwtrent, thanks for the feedback. Lots of insights to unpack here!
   
   > One thing that I think is critical, the vectors in the postings lists must 
be quantized (I would think significantly so, e.g. 1 or 2 bit). Otherwise we 
are doubling the index size as to gain disk locality, vectors must be inline 
with the postings lists.
   
   This peels a critical layer of detail. We certainly want disk locality in 
postings. A simple first step could be to double store vectors - full precision 
in flat format and quantized in postings list. We can do full precision vector 
reordering after collecting hits from postings.
   
   However, it makes me wonder if we really need the flat format in its current 
form. What if we assign ordinals only to the centroids. And keep raw vectors 
only in their respective posting lists, optimized for disk locality. Once we 
identify a posting, we do a full scan. There's no jumping around neighbors to 
randomly access vectors like graphs, so maybe we don't need ordinals for these 
vectors?
   
   This implies we'll have a new format for the raw vectors where centroids are 
assigned ordinals (required by their hnsw graph), and we keep a mapping of 
centroid ordinals to their posting's address. Raw vectors are only stored in 
their respective posting lists. This also means we don't get random access on 
raw vectors, only on the centroids - which is probably okay?
   
   .
   > As for using HNSW as the centroid indexing structure, this feels natural. 
However, one other thing to consider is doing a "hybrid-hnsw", where the higher 
levels are coarser centroids, and the bottom layer is the finest centroids. But 
I think that is probably after moving on from regular kmeans (to hierarchical 
clustering or something).
   
   Agree, we'll learn with experiments.
   
   .
   > For postings overspilling, we need to be careful there. Doing it to the 
naively next nearest has shown to be pretty bad 
(https://arxiv.org/pdf/2601.07183).
   
   Very interesting, thanks for sharing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Cluster Based ANN Vector Search for Lucene [lucene]

Reply via email to