I filed one more JIRA about the approach to specifying the NN algorithm: https://issues.apache.org/jira/browse/LUCENE-9905.
As a summary, here's the current list of vector API issues we're tracking: * Reconsider the format name ( https://issues.apache.org/jira/browse/LUCENE-9855) * Revise approach to specifying NN algorithm ( https://issues.apache.org/jira/browse/LUCENE-9905) * Move VectorValues#search to VectorReader ( https://issues.apache.org/jira/browse/LUCENE-9908) * Should VectorValues expose both iteration and random access? ( https://issues.apache.org/jira/browse/LUCENE-9583) Julie On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote: > I created a JIRA about moving VectorValues#search to VectorReader: > https://issues.apache.org/jira/browse/LUCENE-9908. > > On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote: > >> Hello Mike, >> >> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> >> wrote: >> >>> I think the reason we have search() on VectorValues is that we have >>> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >>> but no way to access the VectorReader. Do you think we should also >>> have LeafReader.getVectorReader()? Today it's only on CodecReader. >>> >> >> I was more thinking of moving VectorValues#search to >> LeafReader#searchNearestVectors or something along those lines. I agree >> that VectorReader should only be exposed on CodecReader. >> >> >>> Re: SearchStrategy.NONE; the idea is we support efficient access to >>> floating point values. Using BinaryDocValues for this will always >>> require an additional decoding step. I can see that the naming is >>> confusing there. The intent is that you index the vector values, but >>> no additional indexing data structure. >> >> >> I wonder if things would be simpler if we were more opinionated and made >> vectors specifically about nearest-neighbor search. Then we have a >> clearer message, use vectors for NN search and doc values otherwise. As far >> as I know, reinterpreting bytes as floats shouldn't add much overhead. The >> main problem I know of is that the JVM won't auto-vectorize if you read >> floats dynamically from a byte[], but this is something that should be >> alleviated by the JDK vector API? >> >> Also: the reason HNSW is >>> mentioned in these SearchStrategy enums is to make room for other >>> vector indexing approaches, like LSH. There was a lot of discussion >>> that we wanted an API that allowed for experimenting with other >>> techniques for indexing and searching vector values. >>> >> >> Actually this is the thing that feels odd to me: if we end up with >> constants for both LSH and HNSW, then we are adding the requirement that >> all vector formats must implement both LSH and HNSW as they will need to >> support all SearchStrategy constants? Would it be possible to have a single >> API and then two implementations of VectorsFormat, LSHVectorsFormat on the >> one hand and HNSWVectorsFormat on the other hand? >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >>> but I think the situation is more akin to Points, where we have the >>> options on IndexableField. The metadata we store there (dimension and >>> score function) don't really result in different formats, ie code >>> paths for indexing and storage; they are more like parameters to the >>> format, in my mind. Perhaps the situation will look different when we >>> get our second vector indexing strategy (like LSH). >> >> >> Having the dimension count and the score function on the FieldType >> actually makes sense to me. I was more wondering whether maxConn >> and beamWidth actually belong to the FieldType, or if they should be made >> constructor arguments of Lucene90VectorFormat. >> >> -- >> Adrien >> > > > -- > Adrien >