Hi, Can't it be better done with FunctionQuery and proper ValueSources? Please also check Lucene Expressions?
On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com> wrote: > Hello! > > I have implemented a custom scoring mechanism. It looks like a dot product. > I would like to ask you how accurate and effective my implementation is, > could you give me recommendations on how to improve it? > > Here are a couple of examples that I want to use this mechanism with. > Example 1: > A document is encoded into a sparse vector, where the terms are the > positions in this vector. A score between a query and a document is located > as a dot product between their vectors. > To do this, I am building the following documents using PyLucene: > doc = Document() > doc.add(StringField("doc_id", str(doc_id), Field.Store.YES)) > doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item())) > doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item())) > doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item())) > > To implement the described search mechanism, I implemented the following > Query: > > public class FieldValueAsScoreQuery extends Query { > > private final String field; > private final float queryTermValue; > > public FieldValueAsScoreQuery(String field, float queryTermValue) { > this.field = Objects.requireNonNull(field); > if (Float.isInfinite(queryTermValue) || > Float.isNaN(queryTermValue)) { > throw new IllegalArgumentException("Query term value must > be finite and non-NaN"); > } > this.queryTermValue = queryTermValue; > } > > @Override > public Weight createWeight(IndexSearcher searcher, ScoreMode > scoreMode, float boost) { > return new Weight(this) { > @Override > public boolean isCacheable(LeafReaderContext ctx) { > return DocValues.isCacheable(ctx, field); > } > > @Override > public Explanation explain(LeafReaderContext context, int doc) > { > throw new UnsupportedOperationException(); > } > > @Override > public Scorer scorer(LeafReaderContext context) throws > IOException { > return new Scorer(this) { > > private final NumericDocValues iterator = > context.reader().getNumericDocValues(field); > > @Override > public float score() throws IOException { > final int docId = docID(); > assert docId != DocIdSetIterator.NO_MORE_DOCS; > assert iterator.advanceExact(docId); > return Float.intBitsToFloat((int) > iterator.longValue()) * queryTermValue * boost; > } > > @Override > public int docID() { > return iterator.docID(); > } > > @Override > public DocIdSetIterator iterator() { > return iterator == null ? > DocIdSetIterator.empty() : iterator; > } > > @Override > public float getMaxScore(int upTo) { > return Float.MAX_VALUE; > } > }; > } > }; > } > > @Override > public String toString(String field) { > StringBuilder builder = new StringBuilder(); > builder.append("FieldValueAsScoreQuery [field="); > builder.append(this.field); > builder.append(", queryTermValue="); > builder.append(this.queryTermValue); > builder.append("]"); > return builder.toString(); > } > > @Override > public void visit(QueryVisitor visitor) { > if (visitor.acceptField(field)) { > visitor.visitLeaf(this); > } > } > > @Override > public boolean equals(Object other) { > return sameClassAs(other) && equalsTo(getClass().cast(other)); > } > > private boolean equalsTo(FieldValueAsScoreQuery other) { > return field.equals(other.field) > && Float.floatToIntBits(queryTermValue) == > Float.floatToIntBits(other.queryTermValue); > } > > @Override > public int hashCode() { > final int prime = 31; > int hash = classHash(); > hash = prime * hash + field.hashCode(); > hash = prime * hash + Float.floatToIntBits(queryTermValue); > return hash; > } > } > > And then I build boolean query as follows (using PyLucene): > > def build_query(query): > builder = BooleanQuery.Builder() > for term in torch.nonzero(query): > field_name = to_field_name(term.item()) > value = query[term].item() > builder.add(FieldValueAsScoreQuery(field_name, value), > BooleanClause.Occur.SHOULD) > return builder.build() > > it seems to work, but I'm not sure if it's a good way to implement it. > Example 2: > I would also like to use this mechanism for the following index: > term1 -> (doc_id1, score), (doc_idN, score), ... > termN -> (doc_id1, score), (doc_idN, score), ... > Where resulting score will be calculated as: > sum(scores) by doc_id for terms in some query > > Thank you in advance! > > Best Regards, > Viacheslav Dobrynin! > -- Sincerely yours Mikhail Khludnev