Hi,
Thanks for the answer!
I think this is similar to my initial implementation, where I built the
query as follows (PyLucene):
def build_query(query):
builder = BooleanQuery.Builder()
for term in torch.nonzero(query):
field_name = to_field_name(term.item())
value = query[term].item()
builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
return builder.build()
And as a score, I used the value from the FloatDocValuesField field as
follows:
@Override
public Scorer get(long leadCost) throws IOException {
return new Scorer() {
private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);
@Override
public float score() throws IOException {
final int docId = docID();
assert docId != DocIdSetIterator.NO_MORE_DOCS;
assert iterator.advanceExact(docId);
return Float.intBitsToFloat((int) iterator.longValue()) *
queryTermValue * boost;
}
@Override
public int docID() {
return iterator.docID();
}
@Override
public DocIdSetIterator iterator() {
return iterator == null ? DocIdSetIterator.empty() : iterator;
}
@Override
public float getMaxScore(int upTo) {
return Float.MAX_VALUE;
}
};
}
Overall it worked pretty well, thanks for confirming the idea.
пн, 2 дек. 2024 г. в 22:42, Michael Sokolov <[email protected]>:
> Another way is using postings - you can represent each dimension as a
> term (`dim0`, `dim1`, etc) and index those that occur in a document.
> To encode a value for a dimension you can either provide a custom term
> frequency, or index the term multiple times. Then when searching you
> can form a BooleanQuery from the terms in the sparse search vector and
> use a simple similarity that sums the term frequencies for ranking. As
> long as the number of non-zero dimensions in the query is low, this
> should be efficient
>
> On Mon, Dec 2, 2024 at 1:17 PM Viacheslav Dobrynin <[email protected]>
> wrote:
> >
> > Hi,
> >
> > Thanks for the reply.
> > I haven't tried to do that.
> > However, I do not fully understand how in this case an inverted index
> will
> > be constructed for an efficient search by terms (O(1) for each term as a
> key
> > )?
> >
> >
> > пн, 2 дек. 2024 г. в 21:55, Patrick Zhai <[email protected]>:
> >
> > > Hi, have you tried to encode the sparse vector yourself using the
> > > BinaryDocValueField? One way I can think of is to encode it as (size,
> > > index_array, value_array) per doc
> > > Intuitively I feel like this should be more efficient than one
> dimension
> > > per field if your dimension is high enough
> > >
> > > Patrick
> > >
> > > On Mon, Dec 2, 2024, 09:03 Viacheslav Dobrynin <[email protected]>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > I need to index sparse vectors, whereas as I understand it,
> > > > KnnFloatVectorField is designed for dense vectors.
> > > > Therefore, it seems that this approach will not work.
> > > >
> > > > вс, 1 дек. 2024 г. в 18:36, Mikhail Khludnev <[email protected]>:
> > > >
> > > > > Hi,
> > > > > May it look like KnnFloatVectorField(... DOT_PRODUCT)
> > > > > and KnnFloatVectorQuery?
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>