Hi all, I have a follow-up question on this. Would it make sense to expose the quantized vector values as well? Currently even if we are quantizing the vectors, calling vectorValue() will return the full precision vectors while the quantized vectors are only used for scorer(). Do we consider the quantized vectors as private information that should not be exposed?
For the context, I'm thinking about a way to run 2-phase rescoring using the 32-bit query vector and 7-bit or 4-bit document vectors (matching phase will use a more aggressive quantization). During the rescoring phase, if we use the quantized scorer(), the main cost is actually the quantization, not the dot product score computation (since we only run it a small number of docs). Doing asymmetric quantization (inspired by BBQ) at the rescoring phase, not only would we improve the recall but also the latency. On Tue, Feb 11, 2025 at 11:50 PM Michael Sokolov <msoko...@gmail.com> wrote: > Stored fields is a separate format that stores data in a row-wise > fashion: all the stored data for a single document is written > together. Vectors aren't *also* copied into stored fields storage, so > the stored fields API can't be used to retrieve them. If we did allow > that it would result in massive duplication for no purpose aside from > making things look simpler. But do you think that it would be more > convenient to use the stored fields API to retrieve the vectors? Does > it hide the details of the leaf structure? Maybe there's an > opportunity to create some convenience API for vectors, not sure. > > On Tue, Feb 11, 2025 at 8:45 AM Viliam Ďurina <viliam.dur...@gmail.com> > wrote: > > > > Thanks Adrien! > > > > The code has one issue: > > if (iterator.advance(leafDocID) == docID) > > should have been: > > if (iterator.advance(leafDocID) == leafDocID) > > > > After fixing this, it works (for reference, I'm using Lucene 10.1). But I > > still wonder why can't we retrieve vectors just as we retrieve any other > > field. I was unable to figure the code out myself, this way it's pretty > > complicated. Is there any reason the vectors are not available through > > `storedFields()`? > > > > Viliam > > > > On Mon, Feb 10, 2025 at 9:21 PM Adrien Grand <jpou...@gmail.com> wrote: > > > > > Hi Viliam, > > > > > > Your logic is mostly correct, here is a version that should be a bit > > > simpler and correct (but beware, untested): > > > > > > IndexReader reader; // your multi-reader > > > int docID; // top-level doc ID > > > int readerID = ReaderUtil.subIndex(docID, reader.leaves()); > > > LeafReaderContext leafContext = reader.leaves().get(readerID); > > > int leafDocID = docID - leafContext.docBase; > > > FloatVectorValues values = > > > leafContext.reader().getFloatVectorValues("my_vector_field"); > > > DocIndexIterator iterator = values.iterator(); > > > float[] vector; > > > if (iterator.advance(leafDocID) == docID) { // this doc ID has a vector > > > vector = values.vectorValue(iterator.index()); > > > } else { > > > vector = null; > > > } > > > > > > On Mon, Feb 10, 2025 at 5:01 PM Viliam Ďurina <viliam.dur...@gmail.com > > > > > wrote: > > > > > > > Dear all, > > > > > > > > when indexing vector fields, Lucene doesn't allow specifying the > vector > > > > field as stored (it throws `IllegalStateException: Cannot store > value of > > > > type class [F`). When trying to retrieve the value using > > > > `IndexReader.storedFields()`, the vector field isn't stored. > > > > > > > > However, Lucene 10 stores the vectors in `.vec` files. I was able to > > > > retrieve them using this complicated code, for which I had to make > the > > > > `readerIndex` and `readerBase` methods in `BaseCompositeReader` > public > > > > (they are protected): > > > > > > > > int docId = ...; // the docId to retrieve, e.g. coming out of a > > > search > > > > IndexReader node = reader.getContext().reader(); > > > > while (node instanceof BaseCompositeReader) { > > > > int index = ((BaseCompositeReader) node).readerIndex(docId); > > > > int base = ((BaseCompositeReader) node).readerBase(index); > > > > docId -= base; > > > > node = ((BaseCompositeReader) > > > > node).getContext().children().get(index).reader(); > > > > } > > > > assert node instanceof LeafReader; > > > > assert node.leaves().size() == 1; > > > > FloatVectorValues vectorValues = > > > > > > > > > node.leaves().getFirst().reader().getFloatVectorValues("myVectorField"); > > > > float[] vector = vectorValues.vectorValue(docId); > > > > > > > > My reader is a `MultiReader`, composed of multiple > `DirectoryReader`s. > > > > > > > > Is there any public API to retrieve the vector values? If not, is > there > > > any > > > > particular reason to not make the vectors available, if Lucene stores > > > them > > > > anyway? Even if the vectors are quantized, original raw vectors are > > > stored, > > > > though they are never used. > > > > > > > > Thanks, > > > > Viliam > > > > > > > > > > > > > -- > > > Adrien > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >