Hello! I have implemented a custom scoring mechanism. It looks like a dot product. I would like to ask you how accurate and effective my implementation is, could you give me recommendations on how to improve it?
Here are a couple of examples that I want to use this mechanism with. Example 1: A document is encoded into a sparse vector, where the terms are the positions in this vector. A score between a query and a document is located as a dot product between their vectors. To do this, I am building the following documents using PyLucene: doc = Document() doc.add(StringField("doc_id", str(doc_id), Field.Store.YES)) doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item())) doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item())) doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item())) To implement the described search mechanism, I implemented the following Query: public class FieldValueAsScoreQuery extends Query { private final String field; private final float queryTermValue; public FieldValueAsScoreQuery(String field, float queryTermValue) { this.field = Objects.requireNonNull(field); if (Float.isInfinite(queryTermValue) || Float.isNaN(queryTermValue)) { throw new IllegalArgumentException("Query term value must be finite and non-NaN"); } this.queryTermValue = queryTermValue; } @Override public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) { return new Weight(this) { @Override public boolean isCacheable(LeafReaderContext ctx) { return DocValues.isCacheable(ctx, field); } @Override public Explanation explain(LeafReaderContext context, int doc) { throw new UnsupportedOperationException(); } @Override public Scorer scorer(LeafReaderContext context) throws IOException { return new Scorer(this) { private final NumericDocValues iterator = context.reader().getNumericDocValues(field); @Override public float score() throws IOException { final int docId = docID(); assert docId != DocIdSetIterator.NO_MORE_DOCS; assert iterator.advanceExact(docId); return Float.intBitsToFloat((int) iterator.longValue()) * queryTermValue * boost; } @Override public int docID() { return iterator.docID(); } @Override public DocIdSetIterator iterator() { return iterator == null ? DocIdSetIterator.empty() : iterator; } @Override public float getMaxScore(int upTo) { return Float.MAX_VALUE; } }; } }; } @Override public String toString(String field) { StringBuilder builder = new StringBuilder(); builder.append("FieldValueAsScoreQuery [field="); builder.append(this.field); builder.append(", queryTermValue="); builder.append(this.queryTermValue); builder.append("]"); return builder.toString(); } @Override public void visit(QueryVisitor visitor) { if (visitor.acceptField(field)) { visitor.visitLeaf(this); } } @Override public boolean equals(Object other) { return sameClassAs(other) && equalsTo(getClass().cast(other)); } private boolean equalsTo(FieldValueAsScoreQuery other) { return field.equals(other.field) && Float.floatToIntBits(queryTermValue) == Float.floatToIntBits(other.queryTermValue); } @Override public int hashCode() { final int prime = 31; int hash = classHash(); hash = prime * hash + field.hashCode(); hash = prime * hash + Float.floatToIntBits(queryTermValue); return hash; } } And then I build boolean query as follows (using PyLucene): def build_query(query): builder = BooleanQuery.Builder() for term in torch.nonzero(query): field_name = to_field_name(term.item()) value = query[term].item() builder.add(FieldValueAsScoreQuery(field_name, value), BooleanClause.Occur.SHOULD) return builder.build() it seems to work, but I'm not sure if it's a good way to implement it. Example 2: I would also like to use this mechanism for the following index: term1 -> (doc_id1, score), (doc_idN, score), ... termN -> (doc_id1, score), (doc_idN, score), ... Where resulting score will be calculated as: sum(scores) by doc_id for terms in some query Thank you in advance! Best Regards, Viacheslav Dobrynin!