Julie Tibshirani created LUCENE-10375:
-----------------------------------------

             Summary: Speed up HNSW merge by writing combined vector data
                 Key: LUCENE-10375
                 URL: https://issues.apache.org/jira/browse/LUCENE-10375
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Julie Tibshirani


When merging segments together, the HNSW writer creates a VectorValues instance 
that gives a merged view of all the segments' VectorValues. This merged 
instance is used when constructing the new HNSW graph. Graph building needs 
random access, and the merged VectorValues support this by mapping from merged 
ordinals -> segments and segment ordinals.

This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
show substantial time in Arrays.binarySearch (used to map an ordinal to a 
segment): 
https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.

Instead of using a merged VectorValues to create the graph, maybe we could 
first write all the segment vectors to a file, and use that file to build the 
graph.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to