While I may not know the exact reason for the HNSW performance issues being
discussed, I can share a relevant experience from my own C++ implementation
that might be helpful.

I recently built an HNSW implementation that was nearly identical to
hnswlib. However, I was only able to achieve about half the performance of
hnswlib. After digging into the problem, I discovered that the primary
factor impacting performance was the memory access pattern.

Specifically, my implementation differed in two key ways:

   1. *Visited Node Tracking:* During the search_layer phase, hnswlib uses
   a simple array of size N to mark visited nodes. In contrast, I was using a
   std::unordered_set (hash set).
   2. *Vector Storage:* hnswlib stores all vectors in a single, large, flat
   array, allowing for contiguous memory access. I, on the other hand, was
   using a std::map to store vectors, mapping an ID to each vector.

The memory access patterns in hnswlib are significantly more efficient,
benefiting greatly from cache locality. When I refactored my code to adopt
hnswlib's approach—using a flat array for vectors and a simple array for
tracking visited nodes—the performance of my implementation became nearly
identical to that of hnswlib.

Given this experience, I wonder if the performance challenges in Lucene's
HNSW implementation might also be related to its memory access patterns. It
could be a valuable area to investigate.


On Thu, Jun 19, 2025 at 11:40 PM Adrien Grand <jpou...@gmail.com> wrote:

> Hello all,
>
> I have been looking at this benchmark against Vespa recently:
> https://blog.vespa.ai/elasticsearch-vs-vespa-performance-comparison/.
> (The report is behind an annoying email wall, but I'm copying relevant data
> below, so hopefully you don't need to download the report.) Even though it
> uses Elasticsearch to run the benchmark, it really benchmarks Lucene
> functionality, I don't believe that Elasticsearch does anything that
> meaningfully alters the results that you would get if you were to run
> Lucene directly.
>
> The benchmark seems designed to highlight the benefits of Vespa's realtime
> design, that's fair game I guess. But it also runs some queries in
> read-only scenarios when I was expecting Lucene to perform better.
>
> One thing that got me curious is that it reports about 2x worse latency
> and throughput for pure unfiltered vector search on a force-merged index
> (so single segment/graph). Does anybody know why Lucene's HNSW may perform
> slower than Vespa's HNSW? This seems consistent with results from
> https://ann-benchmarks.com/index.html though I don't know if the cause of
> the performance difference is the same or not.
>
> For reference, here are details that apply to both Lucene and Vespa's
> vector search:
>  - HNSW,
>  - float32 vectors, no quantization,
>  - embeddings generated using  Snowflake's Arctic-embed-xs model
>  - 1M docs
>  - 384 dimensions,
>  - dot product,
>  - m = 16,
>  - max connections = 200,
>  - search for top 10 hits,
>  - no filter,
>  - single client, so no search concurrency,
>  - purple column is force-merged, so single segment/graph like Vespa.
>
> I never seriously looked at Lucene's vector search performance, so I'm
> very happy to be educated if I'm making naive assumptions!
>
> Somewhat related, is this the reason why I'm seeing many threads around
> bringing 3rd party implementations into Lucene, including ones that are
> very similar to Lucene on paper? To speed up vector search?
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to