Hi, Thank you!
пт, 3 янв. 2025 г. в 14:15, Uwe Schindler <u...@thetaphi.de>: > Hi, > > the expressions query should not be slower. Of course, if you also take > the compilation into the query time measurement it may be little slower > due to compilation and optimizing. In general queries should be warmed > before measuring them + expressions should only be compiled once and > reused many times for querying. As your expression query is constant, > you can make a static istance out of it. > > Be sure to use Lucene 10.x with expressions, this has some optimizations > that make spin-up time shorter due to use of Java 15+ features for > anonymous classes and dynamic constants, which are not available in > Lucene 9.x. > > Uwe > > Am 01.12.2024 um 10:57 schrieb Viacheslav Dobrynin: > > Hi! > > > > Thank you for your reply! > > I tried the recommendations, and below I gave an example code for > > implementing queries. The query with the expression works a little > slower, > > I think this is due to the need for compilation. > > > > I have one more question, please tell me which type of field is best > suited > > for my terms? > > I am currently using the following field: doc.add(FloatDocValuesField(j. > > toFieldName(), value)) However, there may be a more suitable option. > > For example, there is the following description in the changelog for > Lucene > > 10: > > > >> - Lucene now supports sparse indexing on doc values via > >> FieldType#setDocValuesSkipIndexType. The sparse index will record > the > >> minimum and maximum values per block of doc IDs. Used in > conjunction with > >> index sorting to cluster similar documents together, this allows > for very > >> space-efficient and CPU-efficient filtering. > >> > >> However, I find it difficult to answer how applicable this is to my > case. > > Code examples: > > > > fun searchWithFunctionQuery(queryEmb: FloatArray) { > > val reader = DirectoryReader.open(FSDirectory.open(indexPath)) > > val searcher = IndexSearcher(reader) > > > > val productFunctions = mutableListOf<ValueSource>() > > for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) { > > productFunctions += > > ProductFloatFunction(arrayOf(ConstValueSource(qi), > > FloatFieldSource(i.toFieldName()))) > > } > > val dotProductQuery = > > FunctionQuery(SumFloatFunction(productFunctions.toTypedArray())) > > > > val hits = searcher.search(dotProductQuery, 10).scoreDocs > > println("Hits: ${hits.contentToString()}") > > > > // Iterate through the results: > > val storedFields = searcher.storedFields() > > for (i in hits.indices) { > > val hitDoc = storedFields.document(hits[i].doc) > > println("Found doc: $hitDoc. Score: ${hits[i].score}") > > } > > reader.close() > > } > > > > fun searchWithExpression(queryEmb: FloatArray) { > > val reader = DirectoryReader.open(FSDirectory.open(indexPath)) > > val searcher = IndexSearcher(reader) > > searcher.similarity = TodoSimilarity() > > > > val expressionBuilder = StringBuilder() > > val bindings = SimpleBindings() > > for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) { > > if (expressionBuilder.isNotBlank()) expressionBuilder.append(" > + ") > > expressionBuilder.append(qi).append(" * > ").append(i.toFieldName()) > > bindings.add(i.toFieldName(), > > DoubleValuesSource.fromFloatField(i.toFieldName())) > > } > > val dotProductExpression = > > JavascriptCompiler.compile(expressionBuilder.toString()) > > val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(), > > dotProductExpression.getDoubleValuesSource(bindings)) > > val hits = searcher.search(dotProductQuery, 10).scoreDocs > > println("Hits: ${hits.contentToString()}") > > > > // Iterate through the results: > > val storedFields = searcher.storedFields() > > for (i in hits.indices) { > > val hitDoc = storedFields.document(hits[i].doc) > > println("Found doc: $hitDoc. Score: ${hits[i].score}") > > } > > reader.close() > > } > > > > > > Best Regards, > > Viacheslav Dobrynin! > > > > > > сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <m...@apache.org>: > > > >> Hi, > >> Can't it be better done with FunctionQuery and proper ValueSources? > Please > >> also check Lucene Expressions? > >> > >> On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com > > > >> wrote: > >> > >>> Hello! > >>> > >>> I have implemented a custom scoring mechanism. It looks like a dot > >> product. > >>> I would like to ask you how accurate and effective my implementation > is, > >>> could you give me recommendations on how to improve it? > >>> > >>> Here are a couple of examples that I want to use this mechanism with. > >>> Example 1: > >>> A document is encoded into a sparse vector, where the terms are the > >>> positions in this vector. A score between a query and a document is > >> located > >>> as a dot product between their vectors. > >>> To do this, I am building the following documents using PyLucene: > >>> doc = Document() > >>> doc.add(StringField("doc_id", str(doc_id), Field.Store.YES)) > >>> doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, > term].item())) > >>> doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, > term].item())) > >>> doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, > term].item())) > >>> > >>> To implement the described search mechanism, I implemented the > following > >>> Query: > >>> > >>> public class FieldValueAsScoreQuery extends Query { > >>> > >>> private final String field; > >>> private final float queryTermValue; > >>> > >>> public FieldValueAsScoreQuery(String field, float queryTermValue) > { > >>> this.field = Objects.requireNonNull(field); > >>> if (Float.isInfinite(queryTermValue) || > >>> Float.isNaN(queryTermValue)) { > >>> throw new IllegalArgumentException("Query term value must > >>> be finite and non-NaN"); > >>> } > >>> this.queryTermValue = queryTermValue; > >>> } > >>> > >>> @Override > >>> public Weight createWeight(IndexSearcher searcher, ScoreMode > >>> scoreMode, float boost) { > >>> return new Weight(this) { > >>> @Override > >>> public boolean isCacheable(LeafReaderContext ctx) { > >>> return DocValues.isCacheable(ctx, field); > >>> } > >>> > >>> @Override > >>> public Explanation explain(LeafReaderContext context, int > >> doc) > >>> { > >>> throw new UnsupportedOperationException(); > >>> } > >>> > >>> @Override > >>> public Scorer scorer(LeafReaderContext context) throws > >>> IOException { > >>> return new Scorer(this) { > >>> > >>> private final NumericDocValues iterator = > >>> context.reader().getNumericDocValues(field); > >>> > >>> @Override > >>> public float score() throws IOException { > >>> final int docId = docID(); > >>> assert docId != DocIdSetIterator.NO_MORE_DOCS; > >>> assert iterator.advanceExact(docId); > >>> return Float.intBitsToFloat((int) > >>> iterator.longValue()) * queryTermValue * boost; > >>> } > >>> > >>> @Override > >>> public int docID() { > >>> return iterator.docID(); > >>> } > >>> > >>> @Override > >>> public DocIdSetIterator iterator() { > >>> return iterator == null ? > >>> DocIdSetIterator.empty() : iterator; > >>> } > >>> > >>> @Override > >>> public float getMaxScore(int upTo) { > >>> return Float.MAX_VALUE; > >>> } > >>> }; > >>> } > >>> }; > >>> } > >>> > >>> @Override > >>> public String toString(String field) { > >>> StringBuilder builder = new StringBuilder(); > >>> builder.append("FieldValueAsScoreQuery [field="); > >>> builder.append(this.field); > >>> builder.append(", queryTermValue="); > >>> builder.append(this.queryTermValue); > >>> builder.append("]"); > >>> return builder.toString(); > >>> } > >>> > >>> @Override > >>> public void visit(QueryVisitor visitor) { > >>> if (visitor.acceptField(field)) { > >>> visitor.visitLeaf(this); > >>> } > >>> } > >>> > >>> @Override > >>> public boolean equals(Object other) { > >>> return sameClassAs(other) && equalsTo(getClass().cast(other)); > >>> } > >>> > >>> private boolean equalsTo(FieldValueAsScoreQuery other) { > >>> return field.equals(other.field) > >>> && Float.floatToIntBits(queryTermValue) == > >>> Float.floatToIntBits(other.queryTermValue); > >>> } > >>> > >>> @Override > >>> public int hashCode() { > >>> final int prime = 31; > >>> int hash = classHash(); > >>> hash = prime * hash + field.hashCode(); > >>> hash = prime * hash + Float.floatToIntBits(queryTermValue); > >>> return hash; > >>> } > >>> } > >>> > >>> And then I build boolean query as follows (using PyLucene): > >>> > >>> def build_query(query): > >>> builder = BooleanQuery.Builder() > >>> for term in torch.nonzero(query): > >>> field_name = to_field_name(term.item()) > >>> value = query[term].item() > >>> builder.add(FieldValueAsScoreQuery(field_name, value), > >>> BooleanClause.Occur.SHOULD) > >>> return builder.build() > >>> > >>> it seems to work, but I'm not sure if it's a good way to implement it. > >>> Example 2: > >>> I would also like to use this mechanism for the following index: > >>> term1 -> (doc_id1, score), (doc_idN, score), ... > >>> termN -> (doc_id1, score), (doc_idN, score), ... > >>> Where resulting score will be calculated as: > >>> sum(scores) by doc_id for terms in some query > >>> > >>> Thank you in advance! > >>> > >>> Best Regards, > >>> Viacheslav Dobrynin! > >>> > >> > >> -- > >> Sincerely yours > >> Mikhail Khludnev > >> > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >