atris commented on issue #15612: URL: https://github.com/apache/lucene/issues/15612#issuecomment-3824848517
> The trade-off is that once we quantize, we pretty much force a re-ranking phase unless we accept a lower recall ceiling. > I would argue this isn't strictly the case. int7/8/4 can provide 95+% recall depending on the data. > > Single and double bit, possibly for smaller vectors. > > Also, reranking, if done well, isn't that bad. You are talking maybe 100 floating point vectors scored against per Lucene segment vs. many thousands of quantized vector ops. > > Also, regarding the "Largest Centroid" merge idea: it’s smart for speed, but we need to watch out for centroid drift where the HNSW node stops effectively representing the new combined cluster. We might need a lightweight re-centering step during merge even if we don't fully re-cluster. > I think shifting the centroids around will likely be required over time. But I would assume that batch building IVF/Spann/whatever is way faster and easier to do than HNSW over the individual vectors. ;) Agreed on int8. The random seek overhead on the graph traversal dwarfs the linear scan for re-ranking anyway. On the merge/rebuild: It definitely solves the drift, but the catch is the shuffle. We can't stream the merge if we want contiguous posting lists. We have to buffer/group the whole segment by centroid before writing. For large segments, that sort pressure might actually be heavier than the graph build itself. I'll get the baseline float impl in first to settle the format, then we can tackle that buffer logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
