Re: Questions about the new vector API

Julie Tibshirani Tue, 06 Apr 2021 12:38:15 -0700

I filed one more JIRA about the approach to specifying the NN algorithm:
https://issues.apache.org/jira/browse/LUCENE-9905.


As a summary, here's the current list of vector API issues we're tracking:
* Reconsider the format name (
https://issues.apache.org/jira/browse/LUCENE-9855)
* Revise approach to specifying NN algorithm (
https://issues.apache.org/jira/browse/LUCENE-9905)
* Move VectorValues#search to VectorReader (
https://issues.apache.org/jira/browse/LUCENE-9908)
* Should VectorValues expose both iteration and random access? (
https://issues.apache.org/jira/browse/LUCENE-9583)

Julie

On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote:

> I created a JIRA about moving VectorValues#search to VectorReader:
> https://issues.apache.org/jira/browse/LUCENE-9908.
>
> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Hello Mike,
>>
>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>>
>>> I think the reason we have search() on VectorValues is that we have
>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>> but no way to access the VectorReader. Do you think we should also
>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>
>>
>> I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>>
>>
>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>> floating point values. Using BinaryDocValues for this will always
>>> require an additional decoding step. I can see that the naming is
>>> confusing there. The intent is that you index the vector values, but
>>> no additional indexing data structure.
>>
>>
>> I wonder if things would be simpler if we were more opinionated and made
>> vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>>
>> Also: the reason HNSW is
>>> mentioned in these SearchStrategy enums is to make room for other
>>> vector indexing approaches, like LSH. There was a lot of discussion
>>> that we wanted an API that allowed for experimenting with other
>>> techniques for indexing and searching vector values.
>>>
>>
>> Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants? Would it be possible to have a single
>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>> one hand and HNSWVectorsFormat on the other hand?
>>
>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>> but I think the situation is more akin to Points, where we have the
>>> options on IndexableField. The metadata we store there (dimension and
>>> score function) don't really result in different formats, ie code
>>> paths for indexing and storage; they are more like parameters to the
>>> format, in my mind. Perhaps the situation will look different when we
>>> get our second vector indexing strategy (like LSH).
>>
>>
>> Having the dimension count and the score function on the FieldType
>> actually makes sense to me. I was more wondering whether maxConn
>> and beamWidth actually belong to the FieldType, or if they should be made
>> constructor arguments of Lucene90VectorFormat.
>>
>> --
>> Adrien
>>
>
>
> --
> Adrien
>

Re: Questions about the new vector API

Reply via email to