One last follow-up: Robert's comments got me interested in better quantifying the performance against other approaches. I hooked up Lucene HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest neighbor search libraries against large datasets. These two issues describe the results: * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937) * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
Thanks Mike for your insights so far on the search ticket. Julie On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <juliet...@gmail.com> wrote: > I filed one more JIRA about the approach to specifying the NN algorithm: > https://issues.apache.org/jira/browse/LUCENE-9905. > > As a summary, here's the current list of vector API issues we're tracking: > * Reconsider the format name ( > https://issues.apache.org/jira/browse/LUCENE-9855) > * Revise approach to specifying NN algorithm ( > https://issues.apache.org/jira/browse/LUCENE-9905) > * Move VectorValues#search to VectorReader ( > https://issues.apache.org/jira/browse/LUCENE-9908) > * Should VectorValues expose both iteration and random access? ( > https://issues.apache.org/jira/browse/LUCENE-9583) > > Julie > > On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpou...@gmail.com> wrote: > >> I created a JIRA about moving VectorValues#search to VectorReader: >> https://issues.apache.org/jira/browse/LUCENE-9908. >> >> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpou...@gmail.com> wrote: >> >>> Hello Mike, >>> >>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> >>> wrote: >>> >>>> I think the reason we have search() on VectorValues is that we have >>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators), >>>> but no way to access the VectorReader. Do you think we should also >>>> have LeafReader.getVectorReader()? Today it's only on CodecReader. >>>> >>> >>> I was more thinking of moving VectorValues#search to >>> LeafReader#searchNearestVectors or something along those lines. I agree >>> that VectorReader should only be exposed on CodecReader. >>> >>> >>>> Re: SearchStrategy.NONE; the idea is we support efficient access to >>>> floating point values. Using BinaryDocValues for this will always >>>> require an additional decoding step. I can see that the naming is >>>> confusing there. The intent is that you index the vector values, but >>>> no additional indexing data structure. >>> >>> >>> I wonder if things would be simpler if we were more opinionated and made >>> vectors specifically about nearest-neighbor search. Then we have a >>> clearer message, use vectors for NN search and doc values otherwise. As far >>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The >>> main problem I know of is that the JVM won't auto-vectorize if you read >>> floats dynamically from a byte[], but this is something that should be >>> alleviated by the JDK vector API? >>> >>> Also: the reason HNSW is >>>> mentioned in these SearchStrategy enums is to make room for other >>>> vector indexing approaches, like LSH. There was a lot of discussion >>>> that we wanted an API that allowed for experimenting with other >>>> techniques for indexing and searching vector values. >>>> >>> >>> Actually this is the thing that feels odd to me: if we end up with >>> constants for both LSH and HNSW, then we are adding the requirement that >>> all vector formats must implement both LSH and HNSW as they will need to >>> support all SearchStrategy constants? Would it be possible to have a single >>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the >>> one hand and HNSWVectorsFormat on the other hand? >>> >>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), >>>> but I think the situation is more akin to Points, where we have the >>>> options on IndexableField. The metadata we store there (dimension and >>>> score function) don't really result in different formats, ie code >>>> paths for indexing and storage; they are more like parameters to the >>>> format, in my mind. Perhaps the situation will look different when we >>>> get our second vector indexing strategy (like LSH). >>> >>> >>> Having the dimension count and the score function on the FieldType >>> actually makes sense to me. I was more wondering whether maxConn >>> and beamWidth actually belong to the FieldType, or if they should be made >>> constructor arguments of Lucene90VectorFormat. >>> >>> -- >>> Adrien >>> >> >> >> -- >> Adrien >> >