atris commented on issue #14997:
URL: https://github.com/apache/lucene/issues/14997#issuecomment-3800120450

   Thanks @mccullocht and sorry for the delay.
   
   I've re-evaluated the architecture based on your feedback regarding centroid 
density constraints. You are right: 10k vectors/partition is too coarse and 
risks turning search into a brute-force scan.
   
   However, moving to the paper's target density (1-in-8 or even 1-in-100) 
creates a massive metadata problem. For a 1B vector index, a flat centroid 
array would consume ~10-100GB of heap, which violates the core "Low RAM" 
requirement of this RFC.
   
   To solve this density/memory conflict, I propose pivoting to a Disk-Resident 
HNSW-IVF composite architecture:
   
   Navigation (Centroids): Instead of a custom on-heap structure, we will 
delegate centroid indexing to `Lucene99HnswVectorsFormat.` This gives us O(log 
N) navigation, allowing the format to support a high number of centroids 
efficiently.
   
   Storage (Data): The proposal is to implement SpannVectorsFormat purely as 
the clustered storage layer. Vectors will be physically reordered to be 
contiguous by partition ID, ensuring sequential I/O for the candidate 
partitions identified by Tier 1.
   
   This approach—wrapping the off-heap HNSW writer for centroids while handling 
sequential data storage ourselves—seems to be the only path that satisfies the 
memory, density (recall), and adoption constraints simultaneously.
   
   I have a V1 implementation ready for review: 
https://github.com/apache/lucene/pull/15613
   
   cc @jpountz @vigyasharma @benwtrent 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to