[
https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julie Tibshirani resolved LUCENE-9941.
--------------------------------------
Resolution: Done
> ann-benchmarks results for HNSW indexing
> ----------------------------------------
>
> Key: LUCENE-9941
> URL: https://issues.apache.org/jira/browse/LUCENE-9941
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Julie Tibshirani
> Priority: Minor
>
> This is a continuation of LUCENE-9937, but for HNSW index performance.
> Approaches
> * LuceneVectorsOnly: a baseline that only indexes vectors
> * LuceneHnsw: our HNSW implementation, with a force merge to one segment
> * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
> * hnswlib: a C++ HNSW implementation from the author of the paper
> Datasets
> * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128,
> comparing euclidean distance
> * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100,
> comparing cosine similarity
> *Results on sift-128-euclidean*
> Parameters: M=16, efConstruction=500
> {code:java}
> Approach Index time (sec)
> LuceneVectorsOnly 14.93
> LuceneHnsw 3191.16
> LuceneHnswNoForceMerge 1194.31
> hnswlib 311.09
> {code}
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=500
> {code:java}
> Approach Index time (sec)
> LuceneVectorsOnly 14.17
> LuceneHnsw 8940.41
> LuceneHnswNoForceMerge 3623.68
> hnswlib 587.23
> {code}
> We force merge to one segment to emulate a case where vectors aren't
> continually being indexed. In these situations, it seems likely users would
> force merge to optimize search speed: searching a single large graph is
> expected to be faster than searching several small ones serially. To see how
> long the force merge takes, we can subtract LuceneHnswNoForceMerge from
> LuceneHnsw.
> The construction parameters match those in LUCENE-9937 and are optimized for
> search recall + QPS instead of index speed, as I figured this would be a
> common set-up.
> Some observations:
> * In cases when segments are eventually force merged, we do a lot of extra
> work building intermediate graphs that are eventually merged away. This is a
> difficult problem, and one that's been raised in the past. As a simple step,
> I wonder if we should not build graphs for segments that are below a certain
> size. For sufficiently small segments, it could be a better trade-off to
> avoid building a graph and support nearest-neighbor search through a
> brute-force scan?
> * Indexing is slow compared to what we're used to for other formats, even if
> we disregard the extra work mentioned above. For sift-128-euclidean, building
> only the final graph takes ~33 min, whereas for glove-100-angular it's ~88
> min.
> * As a note, graph indexing uses ANN searches in order to add each new
> vector to the graph. So the slower search speed between Lucene and hnswlib
> may contribute to slower indexing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]