atris commented on issue #15612: URL: https://github.com/apache/lucene/issues/15612#issuecomment-3824970598
> > > we need to watch out for centroid drift where the HNSW node stops effectively representing the new combined cluster. > > Right, I believe we'll have to adjust based on participating postings? Merging to the largest ensures we only need to reassign vectors from the smaller postings. With a new centroid, we'll have to run reassignment for vectors across all postings in the cluster, to maintain the NPA property. It's a tradeoff b/w re-using structures from existing segments v/s rebuilding them entirely. If postings in the cluster are far away and similarly sized, it'll likely be more optimal to create a new centroid. Yeah, reusing the centroid saves compute but risks graph quality if the cluster shape shifts significantly. Given Ben's point about batch building, we probably just eat the cost of full reassignment to keep the graph healthy. I'll start with the full rebuild for simplicity. If merge latency kills us, we can optimize with the "largest centroid" heuristic later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
