[ 
https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9941.
--------------------------------------
    Resolution: Done

> ann-benchmarks results for HNSW indexing
> ----------------------------------------
>
>                 Key: LUCENE-9941
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9941
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> This is a continuation of LUCENE-9937, but for HNSW index performance.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors
>  * LuceneHnsw: our HNSW implementation, with a force merge to one segment
>  * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
>  * hnswlib: a C++ HNSW implementation from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach                 Index time (sec)
> LuceneVectorsOnly              14.93
> LuceneHnsw                   3191.16
> LuceneHnswNoForceMerge       1194.31
> hnswlib                       311.09
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach                  Index time (sec)
> LuceneVectorsOnly              14.17
> LuceneHnsw                   8940.41
> LuceneHnswNoForceMerge       3623.68 
> hnswlib                       587.23
> {code}
> We force merge to one segment to emulate a case where vectors aren't 
> continually being indexed. In these situations, it seems likely users would 
> force merge to optimize search speed: searching a single large graph is 
> expected to be faster than searching several small ones serially. To see how 
> long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
> LuceneHnsw.
> The construction parameters match those in LUCENE-9937 and are optimized for 
> search recall + QPS instead of index speed, as I figured this would be a 
> common set-up.
> Some observations:
>  * In cases when segments are eventually force merged, we do a lot of extra 
> work building intermediate graphs that are eventually merged away. This is a 
> difficult problem, and one that's been raised in the past. As a simple step, 
> I wonder if we should not build graphs for segments that are below a certain 
> size. For sufficiently small segments, it could be a better trade-off to 
> avoid building a graph and support nearest-neighbor search through a 
> brute-force scan?
>  * Indexing is slow compared to what we're used to for other formats, even if 
> we disregard the extra work mentioned above. For sift-128-euclidean, building 
> only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 
> min.
>  * As a note, graph indexing uses ANN searches in order to add each new 
> vector to the graph. So the slower search speed between Lucene and hnswlib 
> may contribute to slower indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to