One last follow-up: Robert's comments got me interested in better
quantifying the performance against other approaches. I hooked up Lucene
HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest
neighbor search libraries against large datasets. These two issues describe
the results:
* Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
* Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)

Thanks Mike for your insights so far on the search ticket.

Julie

On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <juliet...@gmail.com>
wrote:

> I filed one more JIRA about the approach to specifying the NN algorithm:
> https://issues.apache.org/jira/browse/LUCENE-9905.
>
> As a summary, here's the current list of vector API issues we're tracking:
> * Reconsider the format name (
> https://issues.apache.org/jira/browse/LUCENE-9855)
> * Revise approach to specifying NN algorithm (
> https://issues.apache.org/jira/browse/LUCENE-9905)
> * Move VectorValues#search to VectorReader (
> https://issues.apache.org/jira/browse/LUCENE-9908)
> * Should VectorValues expose both iteration and random access? (
> https://issues.apache.org/jira/browse/LUCENE-9583)
>
> Julie
>
> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote:
>
>> I created a JIRA about moving VectorValues#search to VectorReader:
>> https://issues.apache.org/jira/browse/LUCENE-9908.
>>
>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote:
>>
>>> Hello Mike,
>>>
>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com>
>>> wrote:
>>>
>>>> I think the reason we have search() on VectorValues is that we have
>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>>> but no way to access the VectorReader. Do you think we should also
>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>>
>>>
>>> I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>>
>>>
>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>>> floating point values. Using BinaryDocValues for this will always
>>>> require an additional decoding step. I can see that the naming is
>>>> confusing there. The intent is that you index the vector values, but
>>>> no additional indexing data structure.
>>>
>>>
>>> I wonder if things would be simpler if we were more opinionated and made
>>> vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>>
>>> Also: the reason HNSW is
>>>> mentioned in these SearchStrategy enums is to make room for other
>>>> vector indexing approaches, like LSH. There was a lot of discussion
>>>> that we wanted an API that allowed for experimenting with other
>>>> techniques for indexing and searching vector values.
>>>>
>>>
>>> Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants? Would it be possible to have a single
>>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>>> one hand and HNSWVectorsFormat on the other hand?
>>>
>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>>> but I think the situation is more akin to Points, where we have the
>>>> options on IndexableField. The metadata we store there (dimension and
>>>> score function) don't really result in different formats, ie code
>>>> paths for indexing and storage; they are more like parameters to the
>>>> format, in my mind. Perhaps the situation will look different when we
>>>> get our second vector indexing strategy (like LSH).
>>>
>>>
>>> Having the dimension count and the score function on the FieldType
>>> actually makes sense to me. I was more wondering whether maxConn
>>> and beamWidth actually belong to the FieldType, or if they should be made
>>> constructor arguments of Lucene90VectorFormat.
>>>
>>> --
>>> Adrien
>>>
>>
>>
>> --
>> Adrien
>>
>

Reply via email to