Configuring the codec based on the schema is something that Solr does via SchemaCodecFactory. https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
Would a similar approach work in your case? Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msoko...@gmail.com> a écrit : > > I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > > Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to > add such visible API changes early on in the project. > > > I wonder if things would be simpler if we were more opinionated and made > vectors specifically about nearest-neighbor search. Then we have a clearer > message, use vectors for NN search and doc values otherwise. As far as I > know, reinterpreting bytes as floats shouldn't add much overhead. The main > problem I know of is that the JVM won't auto-vectorize if you read floats > dynamically from a byte[], but this is something that should be alleviated > by the JDK vector API? > > > Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? > > Hmm I see I didn't think this all the way through ... I guess I had it > in mind that there would probably only ever be a single format with > internal variants for different vector index types, but as I have > worked more with Lucene's index formats I see that is awkward, and I'm > certainly open to restructuring it in a more natural way. Similarly > for the NONE format - BinaryDocValues can be used for such > (non-searchable) vectors. Indeed we had such an implementation and > although we recently switched it to use the NONE format for > uniformity, it could easily be switched back. > > Regarding the graph construction parameters (maxConn and beamWidth) > I'm not sure what the right approach is exactly. We struggled to find > the best API for this. I guess my concern about the PerField* approach > is (at least as I think I understand it) it needs to be configured in > code when creating a Codec. But we would like to be able to read such > parameters from a schema configuration. I think of them as in the same > spirit as an Analyzer. However I may not have fully appreciated the > intention of, or how to make the best use of PerField formats. It is > true we don't really need to write these parameters to the index; > we're free to use different values when merging for example. > > -Mike > > On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpou...@gmail.com> wrote: > > > > Hello Mike, > > > > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> > >> I think the reason we have search() on VectorValues is that we have > >> LeafReader.getVectorValues() (by analogy to the DocValues iterators), > >> but no way to access the VectorReader. Do you think we should also > >> have LeafReader.getVectorReader()? Today it's only on CodecReader. > > > > > > I was more thinking of moving VectorValues#search to > LeafReader#searchNearestVectors or something along those lines. I agree > that VectorReader should only be exposed on CodecReader. > > > >> > >> Re: SearchStrategy.NONE; the idea is we support efficient access to > >> floating point values. Using BinaryDocValues for this will always > >> require an additional decoding step. I can see that the naming is > >> confusing there. The intent is that you index the vector values, but > >> no additional indexing data structure. > > > > > > I wonder if things would be simpler if we were more opinionated and made > vectors specifically about nearest-neighbor search. Then we have a clearer > message, use vectors for NN search and doc values otherwise. As far as I > know, reinterpreting bytes as floats shouldn't add much overhead. The main > problem I know of is that the JVM won't auto-vectorize if you read floats > dynamically from a byte[], but this is something that should be alleviated > by the JDK vector API? > > > >> Also: the reason HNSW is > >> mentioned in these SearchStrategy enums is to make room for other > >> vector indexing approaches, like LSH. There was a lot of discussion > >> that we wanted an API that allowed for experimenting with other > >> techniques for indexing and searching vector values. > > > > > > Actually this is the thing that feels odd to me: if we end up with > constants for both LSH and HNSW, then we are adding the requirement that > all vector formats must implement both LSH and HNSW as they will need to > support all SearchStrategy constants? Would it be possible to have a single > API and then two implementations of VectorsFormat, LSHVectorsFormat on the > one hand and HNSWVectorsFormat on the other hand? > > > >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues), > >> but I think the situation is more akin to Points, where we have the > >> options on IndexableField. The metadata we store there (dimension and > >> score function) don't really result in different formats, ie code > >> paths for indexing and storage; they are more like parameters to the > >> format, in my mind. Perhaps the situation will look different when we > >> get our second vector indexing strategy (like LSH). > > > > > > Having the dimension count and the score function on the FieldType > actually makes sense to me. I was more wondering whether maxConn and > beamWidth actually belong to the FieldType, or if they should be made > constructor arguments of Lucene90VectorFormat. > > > > -- > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >