Hi,

Thanks for the answer!
I think this is similar to my initial implementation, where I built the
query as follows (PyLucene):

def build_query(query):
    builder = BooleanQuery.Builder()
    for term in torch.nonzero(query):
        field_name = to_field_name(term.item())
        value = query[term].item()
        builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
    return builder.build()

And as a score, I used the value from the FloatDocValuesField field as
follows:

@Override
public Scorer get(long leadCost) throws IOException {
    return new Scorer() {

        private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);

        @Override
        public float score() throws IOException {
            final int docId = docID();
            assert docId != DocIdSetIterator.NO_MORE_DOCS;
            assert iterator.advanceExact(docId);
            return Float.intBitsToFloat((int) iterator.longValue()) *
queryTermValue * boost;
        }

        @Override
        public int docID() {
            return iterator.docID();
        }

        @Override
        public DocIdSetIterator iterator() {
            return iterator == null ? DocIdSetIterator.empty() : iterator;
        }

        @Override
        public float getMaxScore(int upTo) {
            return Float.MAX_VALUE;
        }
    };
}

Overall it worked pretty well, thanks for confirming the idea.

пн, 2 дек. 2024 г. в 22:42, Michael Sokolov <msoko...@gmail.com>:

> Another way is using postings - you can represent each dimension as a
> term (`dim0`, `dim1`, etc) and index those that occur in a document.
> To encode a value for a dimension you can either provide a custom term
> frequency, or index the term multiple times. Then when searching you
> can form a BooleanQuery from the terms in the sparse search vector and
> use a simple similarity that sums the term frequencies for ranking. As
> long as the number of non-zero dimensions in the query is low, this
> should be efficient
>
> On Mon, Dec 2, 2024 at 1:17 PM Viacheslav Dobrynin <w.v.d...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Thanks for the reply.
> > I haven't tried to do that.
> > However, I do not fully understand how in this case an inverted index
> will
> > be constructed for an efficient search by terms (O(1) for each term as a
> key
> > )?
> >
> >
> > пн, 2 дек. 2024 г. в 21:55, Patrick Zhai <zhai7...@gmail.com>:
> >
> > > Hi, have you tried to encode the sparse vector yourself using the
> > > BinaryDocValueField? One way I can think of is to encode it as (size,
> > > index_array, value_array) per doc
> > > Intuitively I feel like this should be more efficient than one
> dimension
> > > per field if your dimension is high enough
> > >
> > > Patrick
> > >
> > > On Mon, Dec 2, 2024, 09:03 Viacheslav Dobrynin <w.v.d...@gmail.com>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > I need to index sparse vectors, whereas as I understand it,
> > > > KnnFloatVectorField is designed for dense vectors.
> > > > Therefore, it seems that this approach will not work.
> > > >
> > > > вс, 1 дек. 2024 г. в 18:36, Mikhail Khludnev <m...@apache.org>:
> > > >
> > > > > Hi,
> > > > > May it look like KnnFloatVectorField(... DOT_PRODUCT)
> > > > > and KnnFloatVectorQuery?
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to