atris commented on issue #15612: URL: https://github.com/apache/lucene/issues/15612#issuecomment-3824686268
@benwtrent @vigyasharma Running the baselines on the single-segment flush now (using the PR branch). One thought while those churn: Ben is spot on about the quantization—we definitely need int8 (or 4-bit) for the posting lists to keep I/O in check. The trade-off is that once we quantize, we pretty much force a re-ranking phase unless we accept a lower recall ceiling. I'm keeping it raw floats in the PR for now just to establish the graph structure, but we can drop a decoder in the SpannVectorsReader pretty easily later. Also, regarding the "Largest Centroid" merge idea: it’s smart for speed, but we need to watch out for centroid drift where the HNSW node stops effectively representing the new combined cluster. We might need a lightweight re-centering step during merge even if we don't fully re-cluster. Anyway, let's see what the baseline numbers say first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
