Hello!

I have implemented a custom scoring mechanism. It looks like a dot product.
I would like to ask you how accurate and effective my implementation is,
could you give me recommendations on how to improve it?

Here are a couple of examples that I want to use this mechanism with.
Example 1:
A document is encoded into a sparse vector, where the terms are the
positions in this vector. A score between a query and a document is located
as a dot product between their vectors.
To do this, I am building the following documents using PyLucene:
doc = Document()
doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))

To implement the described search mechanism, I implemented the following
Query:

public class FieldValueAsScoreQuery extends Query {

    private final String field;
    private final float queryTermValue;

    public FieldValueAsScoreQuery(String field, float queryTermValue) {
        this.field = Objects.requireNonNull(field);
        if (Float.isInfinite(queryTermValue) || Float.isNaN(queryTermValue)) {
            throw new IllegalArgumentException("Query term value must
be finite and non-NaN");
        }
        this.queryTermValue = queryTermValue;
    }

    @Override
    public Weight createWeight(IndexSearcher searcher, ScoreMode
scoreMode, float boost) {
        return new Weight(this) {
            @Override
            public boolean isCacheable(LeafReaderContext ctx) {
                return DocValues.isCacheable(ctx, field);
            }

            @Override
            public Explanation explain(LeafReaderContext context, int doc) {
                throw new UnsupportedOperationException();
            }

            @Override
            public Scorer scorer(LeafReaderContext context) throws IOException {
                return new Scorer(this) {

                    private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);

                    @Override
                    public float score() throws IOException {
                        final int docId = docID();
                        assert docId != DocIdSetIterator.NO_MORE_DOCS;
                        assert iterator.advanceExact(docId);
                        return Float.intBitsToFloat((int)
iterator.longValue()) * queryTermValue * boost;
                    }

                    @Override
                    public int docID() {
                        return iterator.docID();
                    }

                    @Override
                    public DocIdSetIterator iterator() {
                        return iterator == null ?
DocIdSetIterator.empty() : iterator;
                    }

                    @Override
                    public float getMaxScore(int upTo) {
                        return Float.MAX_VALUE;
                    }
                };
            }
        };
    }

    @Override
    public String toString(String field) {
        StringBuilder builder = new StringBuilder();
        builder.append("FieldValueAsScoreQuery [field=");
        builder.append(this.field);
        builder.append(", queryTermValue=");
        builder.append(this.queryTermValue);
        builder.append("]");
        return builder.toString();
    }

    @Override
    public void visit(QueryVisitor visitor) {
        if (visitor.acceptField(field)) {
            visitor.visitLeaf(this);
        }
    }

    @Override
    public boolean equals(Object other) {
        return sameClassAs(other) && equalsTo(getClass().cast(other));
    }

    private boolean equalsTo(FieldValueAsScoreQuery other) {
        return field.equals(other.field)
                && Float.floatToIntBits(queryTermValue) ==
Float.floatToIntBits(other.queryTermValue);
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int hash = classHash();
        hash = prime * hash + field.hashCode();
        hash = prime * hash + Float.floatToIntBits(queryTermValue);
        return hash;
    }
}

And then I build boolean query as follows (using PyLucene):

def build_query(query):
    builder = BooleanQuery.Builder()
    for term in torch.nonzero(query):
        field_name = to_field_name(term.item())
        value = query[term].item()
        builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
    return builder.build()

it seems to work, but I'm not sure if it's a good way to implement it.
Example 2:
I would also like to use this mechanism for the following index:
term1 -> (doc_id1, score), (doc_idN, score), ...
termN -> (doc_id1, score), (doc_idN, score), ...
Where resulting score will be calculated as:
sum(scores) by doc_id for terms in some query

Thank you in advance!

Best Regards,
Viacheslav Dobrynin!

Reply via email to