Adrien Grand created LUCENE-10194:
-------------------------------------

             Summary: Should IndexWriter buffer KNN vectors on disk?
                 Key: LUCENE-10194
                 URL: https://issues.apache.org/jira/browse/LUCENE-10194
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


VectorValuesWriter buffers data in memory, like we do for all data structures 
that are computed on flush. But I wonder if this is the right trade-off.

The use-case I have in mind is someone trying to load a dataset of vectors in 
Lucene. Given that HNSW graphs are super expensive to create, we'd ideally load 
that dataset into a single segment rather than many small segments that then 
need to be merged together, which in-turn re-creates the HNSW graph.

Yet buffering vectors in memory is expensive. For instance assuming 256 
dimensions, each vector consumes 1kB of memory. Should we consider buffering 
vectors on disk to reduce chances of having to create new segments only because 
the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to