Re: Questions about the new vector API

Adrien Grand Wed, 17 Mar 2021 00:10:37 -0700

Configuring the codec based on the schema is something that Solr does via
SchemaCodecFactory.
https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java


Would a similar approach work in your case?

Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msoko...@gmail.com> a écrit :

> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
>
> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
> add such visible API changes early on in the project.
>
> > I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a clearer
> message, use vectors for NN search and doc values otherwise. As far as I
> know, reinterpreting bytes as floats shouldn't add much overhead. The main
> problem I know of is that the JVM won't auto-vectorize if you read floats
> dynamically from a byte[], but this is something that should be alleviated
> by the JDK vector API?
>
> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants?
>
> Hmm I see I didn't think this all the way through ... I guess I had it
> in mind that there would probably only ever be a single format with
> internal variants for different vector index types, but as I have
> worked more with Lucene's index formats I see that is awkward, and I'm
> certainly open to restructuring it in a more natural way. Similarly
> for the NONE format - BinaryDocValues can be used for such
> (non-searchable) vectors. Indeed we had such an implementation and
> although we recently switched it to use the NONE format for
> uniformity, it could easily be switched back.
>
> Regarding the graph construction parameters (maxConn and beamWidth)
> I'm not sure what the right approach is exactly. We struggled to find
> the best API for this. I guess my concern about the PerField* approach
> is (at least as I think I understand it) it needs to be configured in
> code when creating a Codec. But we would like to be able to read such
> parameters from a schema configuration. I think of them as in the same
> spirit as an Analyzer. However I may not have fully appreciated the
> intention of, or how to make the best use of PerField formats. It is
> true we don't really need to write these parameters to the index;
> we're free to use different values when merging for example.
>
> -Mike
>
> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpou...@gmail.com> wrote:
> >
> > Hello Mike,
> >
> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >>
> >> I think the reason we have search() on VectorValues is that we have
> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> >> but no way to access the VectorReader. Do you think we should also
> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >
> >
> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >
> >>
> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> floating point values. Using BinaryDocValues for this will always
> >> require an additional decoding step. I can see that the naming is
> >> confusing there. The intent is that you index the vector values, but
> >> no additional indexing data structure.
> >
> >
> > I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a clearer
> message, use vectors for NN search and doc values otherwise. As far as I
> know, reinterpreting bytes as floats shouldn't add much overhead. The main
> problem I know of is that the JVM won't auto-vectorize if you read floats
> dynamically from a byte[], but this is something that should be alleviated
> by the JDK vector API?
> >
> >> Also: the reason HNSW is
> >> mentioned in these SearchStrategy enums is to make room for other
> >> vector indexing approaches, like LSH. There was a lot of discussion
> >> that we wanted an API that allowed for experimenting with other
> >> techniques for indexing and searching vector values.
> >
> >
> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >
> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> >> but I think the situation is more akin to Points, where we have the
> >> options on IndexableField. The metadata we store there (dimension and
> >> score function) don't really result in different formats, ie code
> >> paths for indexing and storage; they are more like parameters to the
> >> format, in my mind. Perhaps the situation will look different when we
> >> get our second vector indexing strategy (like LSH).
> >
> >
> > Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Questions about the new vector API

Reply via email to