Re: Custom Query Implementation

Viacheslav Dobrynin Sun, 01 Dec 2024 01:58:41 -0800

Hi!

Thank you for your reply!
I tried the recommendations, and below I gave an example code for
implementing queries. The query with the expression works a little slower,
I think this is due to the need for compilation.


I have one more question, please tell me which type of field is best suited
for my terms?
I am currently using the following field: doc.add(FloatDocValuesField(j.
toFieldName(), value)) However, there may be a more suitable option.
For example, there is the following description in the changelog for Lucene
10:

>
>    - Lucene now supports sparse indexing on doc values via
>    FieldType#setDocValuesSkipIndexType. The sparse index will record the
>    minimum and maximum values per block of doc IDs. Used in conjunction with
>    index sorting to cluster similar documents together, this allows for very
>    space-efficient and CPU-efficient filtering.
>
> However, I find it difficult to answer how applicable this is to my case.

Code examples:

fun searchWithFunctionQuery(queryEmb: FloatArray) {
    val reader = DirectoryReader.open(FSDirectory.open(indexPath))
    val searcher = IndexSearcher(reader)

    val productFunctions = mutableListOf<ValueSource>()
    for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
        productFunctions +=
ProductFloatFunction(arrayOf(ConstValueSource(qi),
FloatFieldSource(i.toFieldName())))
    }
    val dotProductQuery =
FunctionQuery(SumFloatFunction(productFunctions.toTypedArray()))

    val hits = searcher.search(dotProductQuery, 10).scoreDocs
    println("Hits: ${hits.contentToString()}")

    // Iterate through the results:
    val storedFields = searcher.storedFields()
    for (i in hits.indices) {
        val hitDoc = storedFields.document(hits[i].doc)
        println("Found doc: $hitDoc. Score: ${hits[i].score}")
    }
    reader.close()
}

fun searchWithExpression(queryEmb: FloatArray) {
    val reader = DirectoryReader.open(FSDirectory.open(indexPath))
    val searcher = IndexSearcher(reader)
    searcher.similarity = TodoSimilarity()

    val expressionBuilder = StringBuilder()
    val bindings = SimpleBindings()
    for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
        if (expressionBuilder.isNotBlank()) expressionBuilder.append(" + ")
        expressionBuilder.append(qi).append(" * ").append(i.toFieldName())
        bindings.add(i.toFieldName(),
DoubleValuesSource.fromFloatField(i.toFieldName()))
    }
    val dotProductExpression =
JavascriptCompiler.compile(expressionBuilder.toString())
    val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(),
dotProductExpression.getDoubleValuesSource(bindings))
    val hits = searcher.search(dotProductQuery, 10).scoreDocs
    println("Hits: ${hits.contentToString()}")

    // Iterate through the results:
    val storedFields = searcher.storedFields()
    for (i in hits.indices) {
        val hitDoc = storedFields.document(hits[i].doc)
        println("Found doc: $hitDoc. Score: ${hits[i].score}")
    }
    reader.close()
}


Best Regards,
Viacheslav Dobrynin!


сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <m...@apache.org>:

> Hi,
> Can't it be better done with FunctionQuery and proper ValueSources? Please
> also check Lucene Expressions?
>
> On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com>
> wrote:
>
> > Hello!
> >
> > I have implemented a custom scoring mechanism. It looks like a dot
> product.
> > I would like to ask you how accurate and effective my implementation is,
> > could you give me recommendations on how to improve it?
> >
> > Here are a couple of examples that I want to use this mechanism with.
> > Example 1:
> > A document is encoded into a sparse vector, where the terms are the
> > positions in this vector. A score between a query and a document is
> located
> > as a dot product between their vectors.
> > To do this, I am building the following documents using PyLucene:
> > doc = Document()
> > doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
> > doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
> > doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
> > doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))
> >
> > To implement the described search mechanism, I implemented the following
> > Query:
> >
> > public class FieldValueAsScoreQuery extends Query {
> >
> >     private final String field;
> >     private final float queryTermValue;
> >
> >     public FieldValueAsScoreQuery(String field, float queryTermValue) {
> >         this.field = Objects.requireNonNull(field);
> >         if (Float.isInfinite(queryTermValue) ||
> > Float.isNaN(queryTermValue)) {
> >             throw new IllegalArgumentException("Query term value must
> > be finite and non-NaN");
> >         }
> >         this.queryTermValue = queryTermValue;
> >     }
> >
> >     @Override
> >     public Weight createWeight(IndexSearcher searcher, ScoreMode
> > scoreMode, float boost) {
> >         return new Weight(this) {
> >             @Override
> >             public boolean isCacheable(LeafReaderContext ctx) {
> >                 return DocValues.isCacheable(ctx, field);
> >             }
> >
> >             @Override
> >             public Explanation explain(LeafReaderContext context, int
> doc)
> > {
> >                 throw new UnsupportedOperationException();
> >             }
> >
> >             @Override
> >             public Scorer scorer(LeafReaderContext context) throws
> > IOException {
> >                 return new Scorer(this) {
> >
> >                     private final NumericDocValues iterator =
> > context.reader().getNumericDocValues(field);
> >
> >                     @Override
> >                     public float score() throws IOException {
> >                         final int docId = docID();
> >                         assert docId != DocIdSetIterator.NO_MORE_DOCS;
> >                         assert iterator.advanceExact(docId);
> >                         return Float.intBitsToFloat((int)
> > iterator.longValue()) * queryTermValue * boost;
> >                     }
> >
> >                     @Override
> >                     public int docID() {
> >                         return iterator.docID();
> >                     }
> >
> >                     @Override
> >                     public DocIdSetIterator iterator() {
> >                         return iterator == null ?
> > DocIdSetIterator.empty() : iterator;
> >                     }
> >
> >                     @Override
> >                     public float getMaxScore(int upTo) {
> >                         return Float.MAX_VALUE;
> >                     }
> >                 };
> >             }
> >         };
> >     }
> >
> >     @Override
> >     public String toString(String field) {
> >         StringBuilder builder = new StringBuilder();
> >         builder.append("FieldValueAsScoreQuery [field=");
> >         builder.append(this.field);
> >         builder.append(", queryTermValue=");
> >         builder.append(this.queryTermValue);
> >         builder.append("]");
> >         return builder.toString();
> >     }
> >
> >     @Override
> >     public void visit(QueryVisitor visitor) {
> >         if (visitor.acceptField(field)) {
> >             visitor.visitLeaf(this);
> >         }
> >     }
> >
> >     @Override
> >     public boolean equals(Object other) {
> >         return sameClassAs(other) && equalsTo(getClass().cast(other));
> >     }
> >
> >     private boolean equalsTo(FieldValueAsScoreQuery other) {
> >         return field.equals(other.field)
> >                 && Float.floatToIntBits(queryTermValue) ==
> > Float.floatToIntBits(other.queryTermValue);
> >     }
> >
> >     @Override
> >     public int hashCode() {
> >         final int prime = 31;
> >         int hash = classHash();
> >         hash = prime * hash + field.hashCode();
> >         hash = prime * hash + Float.floatToIntBits(queryTermValue);
> >         return hash;
> >     }
> > }
> >
> > And then I build boolean query as follows (using PyLucene):
> >
> > def build_query(query):
> >     builder = BooleanQuery.Builder()
> >     for term in torch.nonzero(query):
> >         field_name = to_field_name(term.item())
> >         value = query[term].item()
> >         builder.add(FieldValueAsScoreQuery(field_name, value),
> > BooleanClause.Occur.SHOULD)
> >     return builder.build()
> >
> > it seems to work, but I'm not sure if it's a good way to implement it.
> > Example 2:
> > I would also like to use this mechanism for the following index:
> > term1 -> (doc_id1, score), (doc_idN, score), ...
> > termN -> (doc_id1, score), (doc_idN, score), ...
> > Where resulting score will be calculated as:
> > sum(scores) by doc_id for terms in some query
> >
> > Thank you in advance!
> >
> > Best Regards,
> > Viacheslav Dobrynin!
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Custom Query Implementation

Reply via email to