benwtrent commented on issue #15612: URL: https://github.com/apache/lucene/issues/15612#issuecomment-3800009611
I like the idea and agree with the incremental steps for getting it developed. Doing something like this will take time. One thing that I think is critical, the vectors in the postings lists must be quantized (I would think significantly so, e.g. 1 or 2 bit). Otherwise we are doubling the index size as to gain disk locality, vectors must be inline with the postings lists. As for SPFresh, I think the algorithm is nice for incremental updates, but for merges in Apache Lucene, we will need to do something that isn't incremental. Likely bulk updates. But I suspect rebuilding from scratch will still end up being way faster than HNSW is now :) As for using HNSW as the centroid indexing structure, this feels natural. However, one other thing to consider is doing a "hybrid-hnsw", where the higher levels are coarser centroids, and the bottom layer is the finest centroids. But I think that is probably after moving on from regular kmeans (to hierarchical clustering or something). For postings overspilling, we need to be careful there. Doing it to the naively next nearest has shown to be pretty bad (https://arxiv.org/pdf/2601.07183). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
