For comparison I put together a codec that is a copy of Lucene99 but with
most of KnnVectorsReader.search() implemented in native code and called via
FFM as a way of examining overhead from the JVM. I didn't have the same
data set, but on 1M 1536d float vectors with default lucene99 settings it
was about 10% faster in native code -- not nothing, but not very much
considering the additional complexity of using native code. I was able to
avoid copying data from the mmap backing store in order to perform distance
comparisons which was probably a significant chunk of the win. The vast
majority of CPU time (~80%) is spent doing vector comparison and something
like 95-96% of that CPU time is spent scoring in L0. Decoding edges from
the graph is ~1.5%. I think so long as vector scoring code is competitive
and Lucene is doing a similar number of comparisons the margin should be
pretty close.

I looked through their open source implementation and did not see anything
that led me to believe that their HNSW implementation is substantially
different in an algorithmic sense. They did have some logic around choosing
different representations of the visited set depending on the expected
number of nodes to visit (choosing between ~FixedBitSet and a hash set).

On Mon, Jun 23, 2025 at 6:10 AM Benjamin Trent <ben.w.tr...@gmail.com>
wrote:

> To my knowledge, FAISS isn't utilizing hand-rolled SIMD calculations. Do
> we know if it was compiled with `--ffast-math`?
>
> Vespa does utilize SIMD optimizations for vector comparisons.
>
> Some more ways I think Lucene is slower (though, I am not sure the 2x is
> fully explained):
>
>  - Reading floats onto heap float[] instead of accessing Memory Segments
> directly when scoring
>  - We store the graph in a unique way that requires a decoding step when
> exploring a new candidate, reading in vints and doing a binary search. I
> think all other hnsw impls do flat arrays of int/long values.
>  - We always use SparseBitSet, which for smaller indices <1M can have a
> noticeable impact on performance. I have seen this in my own benchmarking
> (switching to fixedbitset measurably improved query times on smaller data
> sets)
>
> Both of these are fairly "cheap". Which might explain the FAISS 10%
> difference. However, I am not sure they fully explain the 2x difference
> with vespa.
>
> On Thu, Jun 19, 2025 at 3:37 PM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Thanks Mike, this is useful information. Then I'll try to reproduce this
>> benchmark to better understand what is happening.
>>
>> On Thu, Jun 19, 2025 at 8:16 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>>
>>> We've recently been comparing Lucene's HNSW w/FAISS' and there is not
>>> a 2x difference there. FAISS does seem to be around 10-15% faster I
>>> think?  The 2x difference is roughly what I was seeing in comparisons
>>> w/hnswlib prior to the dot-product improvements we made in Lucene.
>>>
>>> On Thu, Jun 19, 2025 at 2:12 PM Adrien Grand <jpou...@gmail.com> wrote:
>>> >
>>> > Chris,
>>> >
>>> > FWIW I was looking at luceneknn (
>>> https://github.com/erikbern/ann-benchmarks/blob/f402b2cc17b980d7cd45241ab5a7a4cc0f965e55/ann_benchmarks/algorithms/luceneknn/Dockerfile#L15)
>>> which is on 9.7, though I don't know if it enabled the incubating vector
>>> API at runtime?
>>> >
>>> > I hope that mentioning ANN benchmarks did not add noise to this
>>> thread, I was mostly looking at whether I could find another benchmark that
>>> suggests that Lucene is significantly slower in similar conditions. Does it
>>> align with other people's experience that Lucene is 2x slower or more
>>> compared with other good HNSW implementations?
>>> >
>>> > Adrien
>>> >
>>> > Le jeu. 19 juin 2025, 18:44, Chris Hegarty
>>> <christopher.hega...@elastic.co.invalid> a écrit :
>>> >>
>>> >> Hi Adrien,
>>> >>
>>> >> > Even though it uses Elasticsearch to run the benchmark, it really
>>> benchmarks Lucene functionality,
>>> >>
>>> >> Agreed.
>>> >>
>>> >> > This seems consistent with results from
>>> https://ann-benchmarks.com/index.html though I don't know if the cause
>>> of the performance difference is the same or not.
>>> >>
>>> >> On ann-benchmarks specifically. Unless I’m looking in the wrong
>>> place, then they’re using Elasticsearch 8.7.0 [1], which predates our usage
>>> of the Panama Vector API for vector search. We added support for that in
>>> Lucene 9.7.0 -> Elasticsearch 8.9.0.  So those benchmarks are wildly out of
>>> date, no ?
>>> >>
>>> >> -Chris.
>>> >>
>>> >> [1]
>>> https://github.com/erikbern/ann-benchmarks/blob/f402b2cc17b980d7cd45241ab5a7a4cc0f965e55/ann_benchmarks/algorithms/elasticsearch/Dockerfile#L2
>>> >>
>>> >>
>>> >> > On 19 Jun 2025, at 16:39, Adrien Grand <jpou...@gmail.com> wrote:
>>> >> >
>>> >> > Hello all,
>>> >> >
>>> >> > I have been looking at this benchmark against Vespa recently:
>>> https://blog.vespa.ai/elasticsearch-vs-vespa-performance-comparison/.
>>> (The report is behind an annoying email wall, but I'm copying relevant data
>>> below, so hopefully you don't need to download the report.) Even though it
>>> uses Elasticsearch to run the benchmark, it really benchmarks Lucene
>>> functionality, I don't believe that Elasticsearch does anything that
>>> meaningfully alters the results that you would get if you were to run
>>> Lucene directly.
>>> >> >
>>> >> > The benchmark seems designed to highlight the benefits of Vespa's
>>> realtime design, that's fair game I guess. But it also runs some queries in
>>> read-only scenarios when I was expecting Lucene to perform better.
>>> >> >
>>> >> > One thing that got me curious is that it reports about 2x worse
>>> latency and throughput for pure unfiltered vector search on a force-merged
>>> index (so single segment/graph). Does anybody know why Lucene's HNSW may
>>> perform slower than Vespa's HNSW? This seems consistent with results from
>>> https://ann-benchmarks.com/index.html though I don't know if the cause
>>> of the performance difference is the same or not.
>>> >> >
>>> >> > For reference, here are details that apply to both Lucene and
>>> Vespa's vector search:
>>> >> >  - HNSW,
>>> >> >  - float32 vectors, no quantization,
>>> >> >  - embeddings generated using  Snowflake's Arctic-embed-xs model
>>> >> >  - 1M docs
>>> >> >  - 384 dimensions,
>>> >> >  - dot product,
>>> >> >  - m = 16,
>>> >> >  - max connections = 200,
>>> >> >  - search for top 10 hits,
>>> >> >  - no filter,
>>> >> >  - single client, so no search concurrency,
>>> >> >  - purple column is force-merged, so single segment/graph like
>>> Vespa.
>>> >> >
>>> >> > I never seriously looked at Lucene's vector search performance, so
>>> I'm very happy to be educated if I'm making naive assumptions!
>>> >> >
>>> >> > Somewhat related, is this the reason why I'm seeing many threads
>>> around bringing 3rd party implementations into Lucene, including ones that
>>> are very similar to Lucene on paper? To speed up vector search?
>>> >> >
>>> >> > --
>>> >> > Adrien
>>> >> > <vespa-vs-es-screenshot.png>
>>> >> >
>>> ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> --
>> Adrien
>>
>

Reply via email to