Hi! Thank you for your reply! I tried the recommendations, and below I gave an example code for implementing queries. The query with the expression works a little slower, I think this is due to the need for compilation.
I have one more question, please tell me which type of field is best suited for my terms? I am currently using the following field: doc.add(FloatDocValuesField(j. toFieldName(), value)) However, there may be a more suitable option. For example, there is the following description in the changelog for Lucene 10: > > - Lucene now supports sparse indexing on doc values via > FieldType#setDocValuesSkipIndexType. The sparse index will record the > minimum and maximum values per block of doc IDs. Used in conjunction with > index sorting to cluster similar documents together, this allows for very > space-efficient and CPU-efficient filtering. > > However, I find it difficult to answer how applicable this is to my case. Code examples: fun searchWithFunctionQuery(queryEmb: FloatArray) { val reader = DirectoryReader.open(FSDirectory.open(indexPath)) val searcher = IndexSearcher(reader) val productFunctions = mutableListOf<ValueSource>() for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) { productFunctions += ProductFloatFunction(arrayOf(ConstValueSource(qi), FloatFieldSource(i.toFieldName()))) } val dotProductQuery = FunctionQuery(SumFloatFunction(productFunctions.toTypedArray())) val hits = searcher.search(dotProductQuery, 10).scoreDocs println("Hits: ${hits.contentToString()}") // Iterate through the results: val storedFields = searcher.storedFields() for (i in hits.indices) { val hitDoc = storedFields.document(hits[i].doc) println("Found doc: $hitDoc. Score: ${hits[i].score}") } reader.close() } fun searchWithExpression(queryEmb: FloatArray) { val reader = DirectoryReader.open(FSDirectory.open(indexPath)) val searcher = IndexSearcher(reader) searcher.similarity = TodoSimilarity() val expressionBuilder = StringBuilder() val bindings = SimpleBindings() for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) { if (expressionBuilder.isNotBlank()) expressionBuilder.append(" + ") expressionBuilder.append(qi).append(" * ").append(i.toFieldName()) bindings.add(i.toFieldName(), DoubleValuesSource.fromFloatField(i.toFieldName())) } val dotProductExpression = JavascriptCompiler.compile(expressionBuilder.toString()) val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(), dotProductExpression.getDoubleValuesSource(bindings)) val hits = searcher.search(dotProductQuery, 10).scoreDocs println("Hits: ${hits.contentToString()}") // Iterate through the results: val storedFields = searcher.storedFields() for (i in hits.indices) { val hitDoc = storedFields.document(hits[i].doc) println("Found doc: $hitDoc. Score: ${hits[i].score}") } reader.close() } Best Regards, Viacheslav Dobrynin! сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <m...@apache.org>: > Hi, > Can't it be better done with FunctionQuery and proper ValueSources? Please > also check Lucene Expressions? > > On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com> > wrote: > > > Hello! > > > > I have implemented a custom scoring mechanism. It looks like a dot > product. > > I would like to ask you how accurate and effective my implementation is, > > could you give me recommendations on how to improve it? > > > > Here are a couple of examples that I want to use this mechanism with. > > Example 1: > > A document is encoded into a sparse vector, where the terms are the > > positions in this vector. A score between a query and a document is > located > > as a dot product between their vectors. > > To do this, I am building the following documents using PyLucene: > > doc = Document() > > doc.add(StringField("doc_id", str(doc_id), Field.Store.YES)) > > doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item())) > > doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item())) > > doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item())) > > > > To implement the described search mechanism, I implemented the following > > Query: > > > > public class FieldValueAsScoreQuery extends Query { > > > > private final String field; > > private final float queryTermValue; > > > > public FieldValueAsScoreQuery(String field, float queryTermValue) { > > this.field = Objects.requireNonNull(field); > > if (Float.isInfinite(queryTermValue) || > > Float.isNaN(queryTermValue)) { > > throw new IllegalArgumentException("Query term value must > > be finite and non-NaN"); > > } > > this.queryTermValue = queryTermValue; > > } > > > > @Override > > public Weight createWeight(IndexSearcher searcher, ScoreMode > > scoreMode, float boost) { > > return new Weight(this) { > > @Override > > public boolean isCacheable(LeafReaderContext ctx) { > > return DocValues.isCacheable(ctx, field); > > } > > > > @Override > > public Explanation explain(LeafReaderContext context, int > doc) > > { > > throw new UnsupportedOperationException(); > > } > > > > @Override > > public Scorer scorer(LeafReaderContext context) throws > > IOException { > > return new Scorer(this) { > > > > private final NumericDocValues iterator = > > context.reader().getNumericDocValues(field); > > > > @Override > > public float score() throws IOException { > > final int docId = docID(); > > assert docId != DocIdSetIterator.NO_MORE_DOCS; > > assert iterator.advanceExact(docId); > > return Float.intBitsToFloat((int) > > iterator.longValue()) * queryTermValue * boost; > > } > > > > @Override > > public int docID() { > > return iterator.docID(); > > } > > > > @Override > > public DocIdSetIterator iterator() { > > return iterator == null ? > > DocIdSetIterator.empty() : iterator; > > } > > > > @Override > > public float getMaxScore(int upTo) { > > return Float.MAX_VALUE; > > } > > }; > > } > > }; > > } > > > > @Override > > public String toString(String field) { > > StringBuilder builder = new StringBuilder(); > > builder.append("FieldValueAsScoreQuery [field="); > > builder.append(this.field); > > builder.append(", queryTermValue="); > > builder.append(this.queryTermValue); > > builder.append("]"); > > return builder.toString(); > > } > > > > @Override > > public void visit(QueryVisitor visitor) { > > if (visitor.acceptField(field)) { > > visitor.visitLeaf(this); > > } > > } > > > > @Override > > public boolean equals(Object other) { > > return sameClassAs(other) && equalsTo(getClass().cast(other)); > > } > > > > private boolean equalsTo(FieldValueAsScoreQuery other) { > > return field.equals(other.field) > > && Float.floatToIntBits(queryTermValue) == > > Float.floatToIntBits(other.queryTermValue); > > } > > > > @Override > > public int hashCode() { > > final int prime = 31; > > int hash = classHash(); > > hash = prime * hash + field.hashCode(); > > hash = prime * hash + Float.floatToIntBits(queryTermValue); > > return hash; > > } > > } > > > > And then I build boolean query as follows (using PyLucene): > > > > def build_query(query): > > builder = BooleanQuery.Builder() > > for term in torch.nonzero(query): > > field_name = to_field_name(term.item()) > > value = query[term].item() > > builder.add(FieldValueAsScoreQuery(field_name, value), > > BooleanClause.Occur.SHOULD) > > return builder.build() > > > > it seems to work, but I'm not sure if it's a good way to implement it. > > Example 2: > > I would also like to use this mechanism for the following index: > > term1 -> (doc_id1, score), (doc_idN, score), ... > > termN -> (doc_id1, score), (doc_idN, score), ... > > Where resulting score will be calculated as: > > sum(scores) by doc_id for terms in some query > > > > Thank you in advance! > > > > Best Regards, > > Viacheslav Dobrynin! > > > > > -- > Sincerely yours > Mikhail Khludnev >