Julie Tibshirani created LUCENE-10375:
-----------------------------------------
Summary: Speed up HNSW merge by writing combined vector data
Key: LUCENE-10375
URL: https://issues.apache.org/jira/browse/LUCENE-10375
Project: Lucene - Core
Issue Type: Improvement
Reporter: Julie Tibshirani
When merging segments together, the HNSW writer creates a VectorValues instance
that gives a merged view of all the segments' VectorValues. This merged
instance is used when constructing the new HNSW graph. Graph building needs
random access, and the merged VectorValues support this by mapping from merged
ordinals -> segments and segment ordinals.
This mapping seems to add overhead. The nightly indexing benchmarks sometimes
show substantial time in Arrays.binarySearch (used to map an ordinal to a
segment):
https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
Instead of using a merged VectorValues to create the graph, maybe we could
first write all the segment vectors to a file, and use that file to build the
graph.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]