Re: Custom Query Implementation

Viacheslav Dobrynin Fri, 03 Jan 2025 02:20:40 -0800

Hi,

Thank you!


пт, 3 янв. 2025 г. в 14:15, Uwe Schindler <[email protected]>:

> Hi,
>
> the expressions query should not be slower. Of course, if you also take
> the compilation into the query time measurement it may be little slower
> due to compilation and optimizing. In general queries should be warmed
> before measuring them + expressions should only be compiled once and
> reused many times for querying. As your expression query is constant,
> you can make a static istance out of it.
>
> Be sure to use Lucene 10.x with expressions, this has some optimizations
> that make spin-up time shorter due to use of Java 15+ features for
> anonymous classes and dynamic constants, which are not available in
> Lucene 9.x.
>
> Uwe
>
> Am 01.12.2024 um 10:57 schrieb Viacheslav Dobrynin:
> > Hi!
> >
> > Thank you for your reply!
> > I tried the recommendations, and below I gave an example code for
> > implementing queries. The query with the expression works a little
> slower,
> > I think this is due to the need for compilation.
> >
> > I have one more question, please tell me which type of field is best
> suited
> > for my terms?
> > I am currently using the following field: doc.add(FloatDocValuesField(j.
> > toFieldName(), value)) However, there may be a more suitable option.
> > For example, there is the following description in the changelog for
> Lucene
> > 10:
> >
> >>     - Lucene now supports sparse indexing on doc values via
> >>     FieldType#setDocValuesSkipIndexType. The sparse index will record
> the
> >>     minimum and maximum values per block of doc IDs. Used in
> conjunction with
> >>     index sorting to cluster similar documents together, this allows
> for very
> >>     space-efficient and CPU-efficient filtering.
> >>
> >> However, I find it difficult to answer how applicable this is to my
> case.
> > Code examples:
> >
> > fun searchWithFunctionQuery(queryEmb: FloatArray) {
> >      val reader = DirectoryReader.open(FSDirectory.open(indexPath))
> >      val searcher = IndexSearcher(reader)
> >
> >      val productFunctions = mutableListOf<ValueSource>()
> >      for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
> >          productFunctions +=
> > ProductFloatFunction(arrayOf(ConstValueSource(qi),
> > FloatFieldSource(i.toFieldName())))
> >      }
> >      val dotProductQuery =
> > FunctionQuery(SumFloatFunction(productFunctions.toTypedArray()))
> >
> >      val hits = searcher.search(dotProductQuery, 10).scoreDocs
> >      println("Hits: ${hits.contentToString()}")
> >
> >      // Iterate through the results:
> >      val storedFields = searcher.storedFields()
> >      for (i in hits.indices) {
> >          val hitDoc = storedFields.document(hits[i].doc)
> >          println("Found doc: $hitDoc. Score: ${hits[i].score}")
> >      }
> >      reader.close()
> > }
> >
> > fun searchWithExpression(queryEmb: FloatArray) {
> >      val reader = DirectoryReader.open(FSDirectory.open(indexPath))
> >      val searcher = IndexSearcher(reader)
> >      searcher.similarity = TodoSimilarity()
> >
> >      val expressionBuilder = StringBuilder()
> >      val bindings = SimpleBindings()
> >      for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
> >          if (expressionBuilder.isNotBlank()) expressionBuilder.append("
> + ")
> >          expressionBuilder.append(qi).append(" *
> ").append(i.toFieldName())
> >          bindings.add(i.toFieldName(),
> > DoubleValuesSource.fromFloatField(i.toFieldName()))
> >      }
> >      val dotProductExpression =
> > JavascriptCompiler.compile(expressionBuilder.toString())
> >      val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(),
> > dotProductExpression.getDoubleValuesSource(bindings))
> >      val hits = searcher.search(dotProductQuery, 10).scoreDocs
> >      println("Hits: ${hits.contentToString()}")
> >
> >      // Iterate through the results:
> >      val storedFields = searcher.storedFields()
> >      for (i in hits.indices) {
> >          val hitDoc = storedFields.document(hits[i].doc)
> >          println("Found doc: $hitDoc. Score: ${hits[i].score}")
> >      }
> >      reader.close()
> > }
> >
> >
> > Best Regards,
> > Viacheslav Dobrynin!
> >
> >
> > сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <[email protected]>:
> >
> >> Hi,
> >> Can't it be better done with FunctionQuery and proper ValueSources?
> Please
> >> also check Lucene Expressions?
> >>
> >> On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <[email protected]
> >
> >> wrote:
> >>
> >>> Hello!
> >>>
> >>> I have implemented a custom scoring mechanism. It looks like a dot
> >> product.
> >>> I would like to ask you how accurate and effective my implementation
> is,
> >>> could you give me recommendations on how to improve it?
> >>>
> >>> Here are a couple of examples that I want to use this mechanism with.
> >>> Example 1:
> >>> A document is encoded into a sparse vector, where the terms are the
> >>> positions in this vector. A score between a query and a document is
> >> located
> >>> as a dot product between their vectors.
> >>> To do this, I am building the following documents using PyLucene:
> >>> doc = Document()
> >>> doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
> >>> doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx,
> term].item()))
> >>> doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx,
> term].item()))
> >>> doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx,
> term].item()))
> >>>
> >>> To implement the described search mechanism, I implemented the
> following
> >>> Query:
> >>>
> >>> public class FieldValueAsScoreQuery extends Query {
> >>>
> >>>      private final String field;
> >>>      private final float queryTermValue;
> >>>
> >>>      public FieldValueAsScoreQuery(String field, float queryTermValue)
> {
> >>>          this.field = Objects.requireNonNull(field);
> >>>          if (Float.isInfinite(queryTermValue) ||
> >>> Float.isNaN(queryTermValue)) {
> >>>              throw new IllegalArgumentException("Query term value must
> >>> be finite and non-NaN");
> >>>          }
> >>>          this.queryTermValue = queryTermValue;
> >>>      }
> >>>
> >>>      @Override
> >>>      public Weight createWeight(IndexSearcher searcher, ScoreMode
> >>> scoreMode, float boost) {
> >>>          return new Weight(this) {
> >>>              @Override
> >>>              public boolean isCacheable(LeafReaderContext ctx) {
> >>>                  return DocValues.isCacheable(ctx, field);
> >>>              }
> >>>
> >>>              @Override
> >>>              public Explanation explain(LeafReaderContext context, int
> >> doc)
> >>> {
> >>>                  throw new UnsupportedOperationException();
> >>>              }
> >>>
> >>>              @Override
> >>>              public Scorer scorer(LeafReaderContext context) throws
> >>> IOException {
> >>>                  return new Scorer(this) {
> >>>
> >>>                      private final NumericDocValues iterator =
> >>> context.reader().getNumericDocValues(field);
> >>>
> >>>                      @Override
> >>>                      public float score() throws IOException {
> >>>                          final int docId = docID();
> >>>                          assert docId != DocIdSetIterator.NO_MORE_DOCS;
> >>>                          assert iterator.advanceExact(docId);
> >>>                          return Float.intBitsToFloat((int)
> >>> iterator.longValue()) * queryTermValue * boost;
> >>>                      }
> >>>
> >>>                      @Override
> >>>                      public int docID() {
> >>>                          return iterator.docID();
> >>>                      }
> >>>
> >>>                      @Override
> >>>                      public DocIdSetIterator iterator() {
> >>>                          return iterator == null ?
> >>> DocIdSetIterator.empty() : iterator;
> >>>                      }
> >>>
> >>>                      @Override
> >>>                      public float getMaxScore(int upTo) {
> >>>                          return Float.MAX_VALUE;
> >>>                      }
> >>>                  };
> >>>              }
> >>>          };
> >>>      }
> >>>
> >>>      @Override
> >>>      public String toString(String field) {
> >>>          StringBuilder builder = new StringBuilder();
> >>>          builder.append("FieldValueAsScoreQuery [field=");
> >>>          builder.append(this.field);
> >>>          builder.append(", queryTermValue=");
> >>>          builder.append(this.queryTermValue);
> >>>          builder.append("]");
> >>>          return builder.toString();
> >>>      }
> >>>
> >>>      @Override
> >>>      public void visit(QueryVisitor visitor) {
> >>>          if (visitor.acceptField(field)) {
> >>>              visitor.visitLeaf(this);
> >>>          }
> >>>      }
> >>>
> >>>      @Override
> >>>      public boolean equals(Object other) {
> >>>          return sameClassAs(other) && equalsTo(getClass().cast(other));
> >>>      }
> >>>
> >>>      private boolean equalsTo(FieldValueAsScoreQuery other) {
> >>>          return field.equals(other.field)
> >>>                  && Float.floatToIntBits(queryTermValue) ==
> >>> Float.floatToIntBits(other.queryTermValue);
> >>>      }
> >>>
> >>>      @Override
> >>>      public int hashCode() {
> >>>          final int prime = 31;
> >>>          int hash = classHash();
> >>>          hash = prime * hash + field.hashCode();
> >>>          hash = prime * hash + Float.floatToIntBits(queryTermValue);
> >>>          return hash;
> >>>      }
> >>> }
> >>>
> >>> And then I build boolean query as follows (using PyLucene):
> >>>
> >>> def build_query(query):
> >>>      builder = BooleanQuery.Builder()
> >>>      for term in torch.nonzero(query):
> >>>          field_name = to_field_name(term.item())
> >>>          value = query[term].item()
> >>>          builder.add(FieldValueAsScoreQuery(field_name, value),
> >>> BooleanClause.Occur.SHOULD)
> >>>      return builder.build()
> >>>
> >>> it seems to work, but I'm not sure if it's a good way to implement it.
> >>> Example 2:
> >>> I would also like to use this mechanism for the following index:
> >>> term1 -> (doc_id1, score), (doc_idN, score), ...
> >>> termN -> (doc_id1, score), (doc_idN, score), ...
> >>> Where resulting score will be calculated as:
> >>> sum(scores) by doc_id for terms in some query
> >>>
> >>> Thank you in advance!
> >>>
> >>> Best Regards,
> >>> Viacheslav Dobrynin!
> >>>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Custom Query Implementation

Reply via email to