Draft PR is posted here: https://github.com/apache/lucene/pull/12254

This depends on my PR to use HashMap in the non-concurrent OnHeapHnswGraph
(because that PR updates the tests to not assume sorted order of nodes in a
given level): https://github.com/apache/lucene/pull/12248

On Fri, Apr 28, 2023 at 8:14 AM Jonathan Ellis <jbel...@gmail.com> wrote:

> Great, I will work on squashing to get a clean PR.
>
> One thing I am struggling with is the RamUsageTester.  Here is the
> stacktrace:
> https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12
>
> Apparently RamUsageTester tries to flip private fields to public so it can
> introspect them, but the JVM modularization locks this down for internal
> classes like ThreadLocal.  Unclear to me why this is the first time this
> problem has come up or how to fix it.
>
> On Fri, Apr 28, 2023 at 2:18 AM Alessandro Benedetti <a.benede...@sease.io>
> wrote:
>
>> That's great! And we were talking about this exactly here:
>> https://github.com/apache/lucene/pull/12169
>>
>> It would also help with the new token filter :)
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I've created an HNSW index implementation that allows for concurrent
>>> build and querying.  On my i9-12900 (8 performance cores and 8 efficiency)
>>> I get a bit less than 10x speedup of wall clock time for building and
>>> querying the "siftsmall" and "sift" datasets from
>>> http://corpus-texmex.irisa.fr/.  The small dataset is 10k vectors while
>>> the large is 1M.  This speedup feels pretty good for a data structure that
>>> isn't completely parallelizable, and it's good to see that it's consistent
>>> as the dataset gets larger.
>>>
>>> The concurrent classes achieve identical recall compared to the
>>> non-concurrent versions within my ability to test it, and are drop-in
>>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
>>> to work around the places where the existing API assumes no concurrency.
>>>
>>> The concurrent classes also pass the existing test suite with the
>>> exception of the ram usage ones; the estimator doesn't know about
>>> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
>>> which made it much easier to track down subtle problems!)
>>>
>>> My motivation is
>>>
>>> 1. It is faster to query a single on-heap hnsw index, than to query
>>> multiple such indexes and combine the result.
>>> 2. Even with some contention necessarily occurring during building of
>>> the index, we still come out way ahead in terms of total efficiency vs
>>> creating per-thread indexes and combining them, since combining such
>>> indexes boils down to "pick the largest and then add all the other nodes
>>> normally," you don't really benefit from having computed the others
>>> previously.
>>>
>>> I am currently adding this to Cassandra as code in our repo, but my
>>> preference would be to upstream it.  Is Lucene open to a pull request?
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to