atris commented on issue #14997: URL: https://github.com/apache/lucene/issues/14997#issuecomment-3800120450
Thanks @mccullocht and sorry for the delay. I've re-evaluated the architecture based on your feedback regarding centroid density constraints. You are right: 10k vectors/partition is too coarse and risks turning search into a brute-force scan. However, moving to the paper's target density (1-in-8 or even 1-in-100) creates a massive metadata problem. For a 1B vector index, a flat centroid array would consume ~10-100GB of heap, which violates the core "Low RAM" requirement of this RFC. To solve this density/memory conflict, I propose pivoting to a Disk-Resident HNSW-IVF composite architecture: Navigation (Centroids): Instead of a custom on-heap structure, we will delegate centroid indexing to `Lucene99HnswVectorsFormat.` This gives us O(log N) navigation, allowing the format to support a high number of centroids efficiently. Storage (Data): The proposal is to implement SpannVectorsFormat purely as the clustered storage layer. Vectors will be physically reordered to be contiguous by partition ID, ensuring sequential I/O for the candidate partitions identified by Tier 1. This approach—wrapping the off-heap HNSW writer for centroids while handling sequential data storage ourselves—seems to be the only path that satisfies the memory, density (recall), and adoption constraints simultaneously. I have a V1 implementation ready for review: https://github.com/apache/lucene/pull/15613 cc @jpountz @vigyasharma @benwtrent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
