While I may not know the exact reason for the HNSW performance issues being discussed, I can share a relevant experience from my own C++ implementation that might be helpful.
I recently built an HNSW implementation that was nearly identical to hnswlib. However, I was only able to achieve about half the performance of hnswlib. After digging into the problem, I discovered that the primary factor impacting performance was the memory access pattern. Specifically, my implementation differed in two key ways: 1. *Visited Node Tracking:* During the search_layer phase, hnswlib uses a simple array of size N to mark visited nodes. In contrast, I was using a std::unordered_set (hash set). 2. *Vector Storage:* hnswlib stores all vectors in a single, large, flat array, allowing for contiguous memory access. I, on the other hand, was using a std::map to store vectors, mapping an ID to each vector. The memory access patterns in hnswlib are significantly more efficient, benefiting greatly from cache locality. When I refactored my code to adopt hnswlib's approach—using a flat array for vectors and a simple array for tracking visited nodes—the performance of my implementation became nearly identical to that of hnswlib. Given this experience, I wonder if the performance challenges in Lucene's HNSW implementation might also be related to its memory access patterns. It could be a valuable area to investigate. On Thu, Jun 19, 2025 at 11:40 PM Adrien Grand <jpou...@gmail.com> wrote: > Hello all, > > I have been looking at this benchmark against Vespa recently: > https://blog.vespa.ai/elasticsearch-vs-vespa-performance-comparison/. > (The report is behind an annoying email wall, but I'm copying relevant data > below, so hopefully you don't need to download the report.) Even though it > uses Elasticsearch to run the benchmark, it really benchmarks Lucene > functionality, I don't believe that Elasticsearch does anything that > meaningfully alters the results that you would get if you were to run > Lucene directly. > > The benchmark seems designed to highlight the benefits of Vespa's realtime > design, that's fair game I guess. But it also runs some queries in > read-only scenarios when I was expecting Lucene to perform better. > > One thing that got me curious is that it reports about 2x worse latency > and throughput for pure unfiltered vector search on a force-merged index > (so single segment/graph). Does anybody know why Lucene's HNSW may perform > slower than Vespa's HNSW? This seems consistent with results from > https://ann-benchmarks.com/index.html though I don't know if the cause of > the performance difference is the same or not. > > For reference, here are details that apply to both Lucene and Vespa's > vector search: > - HNSW, > - float32 vectors, no quantization, > - embeddings generated using Snowflake's Arctic-embed-xs model > - 1M docs > - 384 dimensions, > - dot product, > - m = 16, > - max connections = 200, > - search for top 10 hits, > - no filter, > - single client, so no search concurrency, > - purple column is force-merged, so single segment/graph like Vespa. > > I never seriously looked at Lucene's vector search performance, so I'm > very happy to be educated if I'm making naive assumptions! > > Somewhat related, is this the reason why I'm seeing many threads around > bringing 3rd party implementations into Lucene, including ones that are > very similar to Lucene on paper? To speed up vector search? > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org