Hi,
Can't it be better done with FunctionQuery and proper ValueSources? Please
also check Lucene Expressions?

On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com>
wrote:

> Hello!
>
> I have implemented a custom scoring mechanism. It looks like a dot product.
> I would like to ask you how accurate and effective my implementation is,
> could you give me recommendations on how to improve it?
>
> Here are a couple of examples that I want to use this mechanism with.
> Example 1:
> A document is encoded into a sparse vector, where the terms are the
> positions in this vector. A score between a query and a document is located
> as a dot product between their vectors.
> To do this, I am building the following documents using PyLucene:
> doc = Document()
> doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
> doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
> doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
> doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))
>
> To implement the described search mechanism, I implemented the following
> Query:
>
> public class FieldValueAsScoreQuery extends Query {
>
>     private final String field;
>     private final float queryTermValue;
>
>     public FieldValueAsScoreQuery(String field, float queryTermValue) {
>         this.field = Objects.requireNonNull(field);
>         if (Float.isInfinite(queryTermValue) ||
> Float.isNaN(queryTermValue)) {
>             throw new IllegalArgumentException("Query term value must
> be finite and non-NaN");
>         }
>         this.queryTermValue = queryTermValue;
>     }
>
>     @Override
>     public Weight createWeight(IndexSearcher searcher, ScoreMode
> scoreMode, float boost) {
>         return new Weight(this) {
>             @Override
>             public boolean isCacheable(LeafReaderContext ctx) {
>                 return DocValues.isCacheable(ctx, field);
>             }
>
>             @Override
>             public Explanation explain(LeafReaderContext context, int doc)
> {
>                 throw new UnsupportedOperationException();
>             }
>
>             @Override
>             public Scorer scorer(LeafReaderContext context) throws
> IOException {
>                 return new Scorer(this) {
>
>                     private final NumericDocValues iterator =
> context.reader().getNumericDocValues(field);
>
>                     @Override
>                     public float score() throws IOException {
>                         final int docId = docID();
>                         assert docId != DocIdSetIterator.NO_MORE_DOCS;
>                         assert iterator.advanceExact(docId);
>                         return Float.intBitsToFloat((int)
> iterator.longValue()) * queryTermValue * boost;
>                     }
>
>                     @Override
>                     public int docID() {
>                         return iterator.docID();
>                     }
>
>                     @Override
>                     public DocIdSetIterator iterator() {
>                         return iterator == null ?
> DocIdSetIterator.empty() : iterator;
>                     }
>
>                     @Override
>                     public float getMaxScore(int upTo) {
>                         return Float.MAX_VALUE;
>                     }
>                 };
>             }
>         };
>     }
>
>     @Override
>     public String toString(String field) {
>         StringBuilder builder = new StringBuilder();
>         builder.append("FieldValueAsScoreQuery [field=");
>         builder.append(this.field);
>         builder.append(", queryTermValue=");
>         builder.append(this.queryTermValue);
>         builder.append("]");
>         return builder.toString();
>     }
>
>     @Override
>     public void visit(QueryVisitor visitor) {
>         if (visitor.acceptField(field)) {
>             visitor.visitLeaf(this);
>         }
>     }
>
>     @Override
>     public boolean equals(Object other) {
>         return sameClassAs(other) && equalsTo(getClass().cast(other));
>     }
>
>     private boolean equalsTo(FieldValueAsScoreQuery other) {
>         return field.equals(other.field)
>                 && Float.floatToIntBits(queryTermValue) ==
> Float.floatToIntBits(other.queryTermValue);
>     }
>
>     @Override
>     public int hashCode() {
>         final int prime = 31;
>         int hash = classHash();
>         hash = prime * hash + field.hashCode();
>         hash = prime * hash + Float.floatToIntBits(queryTermValue);
>         return hash;
>     }
> }
>
> And then I build boolean query as follows (using PyLucene):
>
> def build_query(query):
>     builder = BooleanQuery.Builder()
>     for term in torch.nonzero(query):
>         field_name = to_field_name(term.item())
>         value = query[term].item()
>         builder.add(FieldValueAsScoreQuery(field_name, value),
> BooleanClause.Occur.SHOULD)
>     return builder.build()
>
> it seems to work, but I'm not sure if it's a good way to implement it.
> Example 2:
> I would also like to use this mechanism for the following index:
> term1 -> (doc_id1, score), (doc_idN, score), ...
> termN -> (doc_id1, score), (doc_idN, score), ...
> Where resulting score will be calculated as:
> sum(scores) by doc_id for terms in some query
>
> Thank you in advance!
>
> Best Regards,
> Viacheslav Dobrynin!
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to