Re: Custom Query Implementation

Uwe Schindler Fri, 03 Jan 2025 02:15:58 -0800

Hi,

the expressions query should not be slower. Of course, if you also takethe compilation into the query time measurement it may be little slowerdue to compilation and optimizing. In general queries should be warmedbefore measuring them + expressions should only be compiled once andreused many times for querying. As your expression query is constant,you can make a static istance out of it.

Be sure to use Lucene 10.x with expressions, this has some optimizationsthat make spin-up time shorter due to use of Java 15+ features foranonymous classes and dynamic constants, which are not available inLucene 9.x.


Uwe

Am 01.12.2024 um 10:57 schrieb Viacheslav Dobrynin:

Hi!

Thank you for your reply!
I tried the recommendations, and below I gave an example code for
implementing queries. The query with the expression works a little slower,
I think this is due to the need for compilation.

I have one more question, please tell me which type of field is best suited
for my terms?
I am currently using the following field: doc.add(FloatDocValuesField(j.
toFieldName(), value)) However, there may be a more suitable option.
For example, there is the following description in the changelog for Lucene
10:

    - Lucene now supports sparse indexing on doc values via
    FieldType#setDocValuesSkipIndexType. The sparse index will record the
    minimum and maximum values per block of doc IDs. Used in conjunction with
    index sorting to cluster similar documents together, this allows for very
    space-efficient and CPU-efficient filtering.

However, I find it difficult to answer how applicable this is to my case.

Code examples:

fun searchWithFunctionQuery(queryEmb: FloatArray) {
     val reader = DirectoryReader.open(FSDirectory.open(indexPath))
     val searcher = IndexSearcher(reader)

     val productFunctions = mutableListOf<ValueSource>()
     for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
         productFunctions +=
ProductFloatFunction(arrayOf(ConstValueSource(qi),
FloatFieldSource(i.toFieldName())))
     }
     val dotProductQuery =
FunctionQuery(SumFloatFunction(productFunctions.toTypedArray()))

     val hits = searcher.search(dotProductQuery, 10).scoreDocs
     println("Hits: ${hits.contentToString()}")

     // Iterate through the results:
     val storedFields = searcher.storedFields()
     for (i in hits.indices) {
         val hitDoc = storedFields.document(hits[i].doc)
         println("Found doc: $hitDoc. Score: ${hits[i].score}")
     }
     reader.close()
}

fun searchWithExpression(queryEmb: FloatArray) {
     val reader = DirectoryReader.open(FSDirectory.open(indexPath))
     val searcher = IndexSearcher(reader)
     searcher.similarity = TodoSimilarity()

     val expressionBuilder = StringBuilder()
     val bindings = SimpleBindings()
     for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
         if (expressionBuilder.isNotBlank()) expressionBuilder.append(" + ")
         expressionBuilder.append(qi).append(" * ").append(i.toFieldName())
         bindings.add(i.toFieldName(),
DoubleValuesSource.fromFloatField(i.toFieldName()))
     }
     val dotProductExpression =
JavascriptCompiler.compile(expressionBuilder.toString())
     val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(),
dotProductExpression.getDoubleValuesSource(bindings))
     val hits = searcher.search(dotProductQuery, 10).scoreDocs
     println("Hits: ${hits.contentToString()}")

     // Iterate through the results:
     val storedFields = searcher.storedFields()
     for (i in hits.indices) {
         val hitDoc = storedFields.document(hits[i].doc)
         println("Found doc: $hitDoc. Score: ${hits[i].score}")
     }
     reader.close()
}


Best Regards,
Viacheslav Dobrynin!


сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <m...@apache.org>:

Hi,
Can't it be better done with FunctionQuery and proper ValueSources? Please
also check Lucene Expressions?

On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com>
wrote:

Hello!

I have implemented a custom scoring mechanism. It looks like a dot

product.

I would like to ask you how accurate and effective my implementation is,
could you give me recommendations on how to improve it?

Here are a couple of examples that I want to use this mechanism with.
Example 1:
A document is encoded into a sparse vector, where the terms are the
positions in this vector. A score between a query and a document is

located

as a dot product between their vectors.
To do this, I am building the following documents using PyLucene:
doc = Document()
doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))

To implement the described search mechanism, I implemented the following
Query:

public class FieldValueAsScoreQuery extends Query {

     private final String field;
     private final float queryTermValue;

     public FieldValueAsScoreQuery(String field, float queryTermValue) {
         this.field = Objects.requireNonNull(field);
         if (Float.isInfinite(queryTermValue) ||
Float.isNaN(queryTermValue)) {
             throw new IllegalArgumentException("Query term value must
be finite and non-NaN");
         }
         this.queryTermValue = queryTermValue;
     }

     @Override
     public Weight createWeight(IndexSearcher searcher, ScoreMode
scoreMode, float boost) {
         return new Weight(this) {
             @Override
             public boolean isCacheable(LeafReaderContext ctx) {
                 return DocValues.isCacheable(ctx, field);
             }

             @Override
             public Explanation explain(LeafReaderContext context, int

doc)

{
                 throw new UnsupportedOperationException();
             }

             @Override
             public Scorer scorer(LeafReaderContext context) throws
IOException {
                 return new Scorer(this) {

                     private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);

                     @Override
                     public float score() throws IOException {
                         final int docId = docID();
                         assert docId != DocIdSetIterator.NO_MORE_DOCS;
                         assert iterator.advanceExact(docId);
                         return Float.intBitsToFloat((int)
iterator.longValue()) * queryTermValue * boost;
                     }

                     @Override
                     public int docID() {
                         return iterator.docID();
                     }

                     @Override
                     public DocIdSetIterator iterator() {
                         return iterator == null ?
DocIdSetIterator.empty() : iterator;
                     }

                     @Override
                     public float getMaxScore(int upTo) {
                         return Float.MAX_VALUE;
                     }
                 };
             }
         };
     }

     @Override
     public String toString(String field) {
         StringBuilder builder = new StringBuilder();
         builder.append("FieldValueAsScoreQuery [field=");
         builder.append(this.field);
         builder.append(", queryTermValue=");
         builder.append(this.queryTermValue);
         builder.append("]");
         return builder.toString();
     }

     @Override
     public void visit(QueryVisitor visitor) {
         if (visitor.acceptField(field)) {
             visitor.visitLeaf(this);
         }
     }

     @Override
     public boolean equals(Object other) {
         return sameClassAs(other) && equalsTo(getClass().cast(other));
     }

     private boolean equalsTo(FieldValueAsScoreQuery other) {
         return field.equals(other.field)
                 && Float.floatToIntBits(queryTermValue) ==
Float.floatToIntBits(other.queryTermValue);
     }

     @Override
     public int hashCode() {
         final int prime = 31;
         int hash = classHash();
         hash = prime * hash + field.hashCode();
         hash = prime * hash + Float.floatToIntBits(queryTermValue);
         return hash;
     }
}

And then I build boolean query as follows (using PyLucene):

def build_query(query):
     builder = BooleanQuery.Builder()
     for term in torch.nonzero(query):
         field_name = to_field_name(term.item())
         value = query[term].item()
         builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
     return builder.build()

it seems to work, but I'm not sure if it's a good way to implement it.
Example 2:
I would also like to use this mechanism for the following index:
term1 -> (doc_id1, score), (doc_idN, score), ...
termN -> (doc_id1, score), (doc_idN, score), ...
Where resulting score will be calculated as:
sum(scores) by doc_id for terms in some query

Thank you in advance!

Best Regards,
Viacheslav Dobrynin!


--
Sincerely yours
Mikhail Khludnev

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Custom Query Implementation

Reply via email to