Hi!
Thank you for your reply!
I tried the recommendations, and below I gave an example code for
implementing queries. The query with the expression works a little slower,
I think this is due to the need for compilation.
I have one more question, please tell me which type of field is best suited
for my terms?
I am currently using the following field: doc.add(FloatDocValuesField(j.
toFieldName(), value)) However, there may be a more suitable option.
For example, there is the following description in the changelog for Lucene
10:
>
> - Lucene now supports sparse indexing on doc values via
> FieldType#setDocValuesSkipIndexType. The sparse index will record the
> minimum and maximum values per block of doc IDs. Used in conjunction with
> index sorting to cluster similar documents together, this allows for very
> space-efficient and CPU-efficient filtering.
>
> However, I find it difficult to answer how applicable this is to my case.
Code examples:
fun searchWithFunctionQuery(queryEmb: FloatArray) {
val reader = DirectoryReader.open(FSDirectory.open(indexPath))
val searcher = IndexSearcher(reader)
val productFunctions = mutableListOf<ValueSource>()
for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
productFunctions +=
ProductFloatFunction(arrayOf(ConstValueSource(qi),
FloatFieldSource(i.toFieldName())))
}
val dotProductQuery =
FunctionQuery(SumFloatFunction(productFunctions.toTypedArray()))
val hits = searcher.search(dotProductQuery, 10).scoreDocs
println("Hits: ${hits.contentToString()}")
// Iterate through the results:
val storedFields = searcher.storedFields()
for (i in hits.indices) {
val hitDoc = storedFields.document(hits[i].doc)
println("Found doc: $hitDoc. Score: ${hits[i].score}")
}
reader.close()
}
fun searchWithExpression(queryEmb: FloatArray) {
val reader = DirectoryReader.open(FSDirectory.open(indexPath))
val searcher = IndexSearcher(reader)
searcher.similarity = TodoSimilarity()
val expressionBuilder = StringBuilder()
val bindings = SimpleBindings()
for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
if (expressionBuilder.isNotBlank()) expressionBuilder.append(" + ")
expressionBuilder.append(qi).append(" * ").append(i.toFieldName())
bindings.add(i.toFieldName(),
DoubleValuesSource.fromFloatField(i.toFieldName()))
}
val dotProductExpression =
JavascriptCompiler.compile(expressionBuilder.toString())
val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(),
dotProductExpression.getDoubleValuesSource(bindings))
val hits = searcher.search(dotProductQuery, 10).scoreDocs
println("Hits: ${hits.contentToString()}")
// Iterate through the results:
val storedFields = searcher.storedFields()
for (i in hits.indices) {
val hitDoc = storedFields.document(hits[i].doc)
println("Found doc: $hitDoc. Score: ${hits[i].score}")
}
reader.close()
}
Best Regards,
Viacheslav Dobrynin!
сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <[email protected]>:
> Hi,
> Can't it be better done with FunctionQuery and proper ValueSources? Please
> also check Lucene Expressions?
>
> On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <[email protected]>
> wrote:
>
> > Hello!
> >
> > I have implemented a custom scoring mechanism. It looks like a dot
> product.
> > I would like to ask you how accurate and effective my implementation is,
> > could you give me recommendations on how to improve it?
> >
> > Here are a couple of examples that I want to use this mechanism with.
> > Example 1:
> > A document is encoded into a sparse vector, where the terms are the
> > positions in this vector. A score between a query and a document is
> located
> > as a dot product between their vectors.
> > To do this, I am building the following documents using PyLucene:
> > doc = Document()
> > doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
> > doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
> > doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
> > doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))
> >
> > To implement the described search mechanism, I implemented the following
> > Query:
> >
> > public class FieldValueAsScoreQuery extends Query {
> >
> > private final String field;
> > private final float queryTermValue;
> >
> > public FieldValueAsScoreQuery(String field, float queryTermValue) {
> > this.field = Objects.requireNonNull(field);
> > if (Float.isInfinite(queryTermValue) ||
> > Float.isNaN(queryTermValue)) {
> > throw new IllegalArgumentException("Query term value must
> > be finite and non-NaN");
> > }
> > this.queryTermValue = queryTermValue;
> > }
> >
> > @Override
> > public Weight createWeight(IndexSearcher searcher, ScoreMode
> > scoreMode, float boost) {
> > return new Weight(this) {
> > @Override
> > public boolean isCacheable(LeafReaderContext ctx) {
> > return DocValues.isCacheable(ctx, field);
> > }
> >
> > @Override
> > public Explanation explain(LeafReaderContext context, int
> doc)
> > {
> > throw new UnsupportedOperationException();
> > }
> >
> > @Override
> > public Scorer scorer(LeafReaderContext context) throws
> > IOException {
> > return new Scorer(this) {
> >
> > private final NumericDocValues iterator =
> > context.reader().getNumericDocValues(field);
> >
> > @Override
> > public float score() throws IOException {
> > final int docId = docID();
> > assert docId != DocIdSetIterator.NO_MORE_DOCS;
> > assert iterator.advanceExact(docId);
> > return Float.intBitsToFloat((int)
> > iterator.longValue()) * queryTermValue * boost;
> > }
> >
> > @Override
> > public int docID() {
> > return iterator.docID();
> > }
> >
> > @Override
> > public DocIdSetIterator iterator() {
> > return iterator == null ?
> > DocIdSetIterator.empty() : iterator;
> > }
> >
> > @Override
> > public float getMaxScore(int upTo) {
> > return Float.MAX_VALUE;
> > }
> > };
> > }
> > };
> > }
> >
> > @Override
> > public String toString(String field) {
> > StringBuilder builder = new StringBuilder();
> > builder.append("FieldValueAsScoreQuery [field=");
> > builder.append(this.field);
> > builder.append(", queryTermValue=");
> > builder.append(this.queryTermValue);
> > builder.append("]");
> > return builder.toString();
> > }
> >
> > @Override
> > public void visit(QueryVisitor visitor) {
> > if (visitor.acceptField(field)) {
> > visitor.visitLeaf(this);
> > }
> > }
> >
> > @Override
> > public boolean equals(Object other) {
> > return sameClassAs(other) && equalsTo(getClass().cast(other));
> > }
> >
> > private boolean equalsTo(FieldValueAsScoreQuery other) {
> > return field.equals(other.field)
> > && Float.floatToIntBits(queryTermValue) ==
> > Float.floatToIntBits(other.queryTermValue);
> > }
> >
> > @Override
> > public int hashCode() {
> > final int prime = 31;
> > int hash = classHash();
> > hash = prime * hash + field.hashCode();
> > hash = prime * hash + Float.floatToIntBits(queryTermValue);
> > return hash;
> > }
> > }
> >
> > And then I build boolean query as follows (using PyLucene):
> >
> > def build_query(query):
> > builder = BooleanQuery.Builder()
> > for term in torch.nonzero(query):
> > field_name = to_field_name(term.item())
> > value = query[term].item()
> > builder.add(FieldValueAsScoreQuery(field_name, value),
> > BooleanClause.Occur.SHOULD)
> > return builder.build()
> >
> > it seems to work, but I'm not sure if it's a good way to implement it.
> > Example 2:
> > I would also like to use this mechanism for the following index:
> > term1 -> (doc_id1, score), (doc_idN, score), ...
> > termN -> (doc_id1, score), (doc_idN, score), ...
> > Where resulting score will be calculated as:
> > sum(scores) by doc_id for terms in some query
> >
> > Thank you in advance!
> >
> > Best Regards,
> > Viacheslav Dobrynin!
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>