Hello!
I have implemented a custom scoring mechanism. It looks like a dot product.
I would like to ask you how accurate and effective my implementation is,
could you give me recommendations on how to improve it?
Here are a couple of examples that I want to use this mechanism with.
Example 1:
A document is encoded into a sparse vector, where the terms are the
positions in this vector. A score between a query and a document is located
as a dot product between their vectors.
To do this, I am building the following documents using PyLucene:
doc = Document()
doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))
To implement the described search mechanism, I implemented the following
Query:
public class FieldValueAsScoreQuery extends Query {
private final String field;
private final float queryTermValue;
public FieldValueAsScoreQuery(String field, float queryTermValue) {
this.field = Objects.requireNonNull(field);
if (Float.isInfinite(queryTermValue) || Float.isNaN(queryTermValue)) {
throw new IllegalArgumentException("Query term value must
be finite and non-NaN");
}
this.queryTermValue = queryTermValue;
}
@Override
public Weight createWeight(IndexSearcher searcher, ScoreMode
scoreMode, float boost) {
return new Weight(this) {
@Override
public boolean isCacheable(LeafReaderContext ctx) {
return DocValues.isCacheable(ctx, field);
}
@Override
public Explanation explain(LeafReaderContext context, int doc) {
throw new UnsupportedOperationException();
}
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
return new Scorer(this) {
private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);
@Override
public float score() throws IOException {
final int docId = docID();
assert docId != DocIdSetIterator.NO_MORE_DOCS;
assert iterator.advanceExact(docId);
return Float.intBitsToFloat((int)
iterator.longValue()) * queryTermValue * boost;
}
@Override
public int docID() {
return iterator.docID();
}
@Override
public DocIdSetIterator iterator() {
return iterator == null ?
DocIdSetIterator.empty() : iterator;
}
@Override
public float getMaxScore(int upTo) {
return Float.MAX_VALUE;
}
};
}
};
}
@Override
public String toString(String field) {
StringBuilder builder = new StringBuilder();
builder.append("FieldValueAsScoreQuery [field=");
builder.append(this.field);
builder.append(", queryTermValue=");
builder.append(this.queryTermValue);
builder.append("]");
return builder.toString();
}
@Override
public void visit(QueryVisitor visitor) {
if (visitor.acceptField(field)) {
visitor.visitLeaf(this);
}
}
@Override
public boolean equals(Object other) {
return sameClassAs(other) && equalsTo(getClass().cast(other));
}
private boolean equalsTo(FieldValueAsScoreQuery other) {
return field.equals(other.field)
&& Float.floatToIntBits(queryTermValue) ==
Float.floatToIntBits(other.queryTermValue);
}
@Override
public int hashCode() {
final int prime = 31;
int hash = classHash();
hash = prime * hash + field.hashCode();
hash = prime * hash + Float.floatToIntBits(queryTermValue);
return hash;
}
}
And then I build boolean query as follows (using PyLucene):
def build_query(query):
builder = BooleanQuery.Builder()
for term in torch.nonzero(query):
field_name = to_field_name(term.item())
value = query[term].item()
builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
return builder.build()
it seems to work, but I'm not sure if it's a good way to implement it.
Example 2:
I would also like to use this mechanism for the following index:
term1 -> (doc_id1, score), (doc_idN, score), ...
termN -> (doc_id1, score), (doc_idN, score), ...
Where resulting score will be calculated as:
sum(scores) by doc_id for terms in some query
Thank you in advance!
Best Regards,
Viacheslav Dobrynin!