vigyasharma commented on issue #15612: URL: https://github.com/apache/lucene/issues/15612#issuecomment-3800886217
@benwtrent, thanks for the feedback. Lots of insights to unpack here! > One thing that I think is critical, the vectors in the postings lists must be quantized (I would think significantly so, e.g. 1 or 2 bit). Otherwise we are doubling the index size as to gain disk locality, vectors must be inline with the postings lists. This peels a critical layer of detail. We certainly want disk locality in postings. A simple first step could be to double store vectors - full precision in flat format and quantized in postings list. We can do full precision vector reordering after collecting hits from postings. However, it makes me wonder if we really need the flat format in its current form. What if we assign ordinals only to the centroids. And keep raw vectors only in their respective posting lists, optimized for disk locality. Once we identify a posting, we do a full scan. There's no jumping around neighbors to randomly access vectors like graphs, so maybe we don't need ordinals for these vectors? This implies we'll have a new format for the raw vectors where centroids are assigned ordinals (required by their hnsw graph), and we keep a mapping of centroid ordinals to their posting's address. Raw vectors are only stored in their respective posting lists. This also means we don't get random access on raw vectors, only on the centroids - which is probably okay? . > As for using HNSW as the centroid indexing structure, this feels natural. However, one other thing to consider is doing a "hybrid-hnsw", where the higher levels are coarser centroids, and the bottom layer is the finest centroids. But I think that is probably after moving on from regular kmeans (to hierarchical clustering or something). Agree, we'll learn with experiments. . > For postings overspilling, we need to be careful there. Doing it to the naively next nearest has shown to be pretty bad (https://arxiv.org/pdf/2601.07183). Very interesting, thanks for sharing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
