Hi,
the expressions query should not be slower. Of course, if you also take
the compilation into the query time measurement it may be little slower
due to compilation and optimizing. In general queries should be warmed
before measuring them + expressions should only be compiled once and
reused many times for querying. As your expression query is constant,
you can make a static istance out of it.
Be sure to use Lucene 10.x with expressions, this has some optimizations
that make spin-up time shorter due to use of Java 15+ features for
anonymous classes and dynamic constants, which are not available in
Lucene 9.x.
Uwe
Am 01.12.2024 um 10:57 schrieb Viacheslav Dobrynin:
Hi!
Thank you for your reply!
I tried the recommendations, and below I gave an example code for
implementing queries. The query with the expression works a little slower,
I think this is due to the need for compilation.
I have one more question, please tell me which type of field is best suited
for my terms?
I am currently using the following field: doc.add(FloatDocValuesField(j.
toFieldName(), value)) However, there may be a more suitable option.
For example, there is the following description in the changelog for Lucene
10:
- Lucene now supports sparse indexing on doc values via
FieldType#setDocValuesSkipIndexType. The sparse index will record the
minimum and maximum values per block of doc IDs. Used in conjunction with
index sorting to cluster similar documents together, this allows for very
space-efficient and CPU-efficient filtering.
However, I find it difficult to answer how applicable this is to my case.
Code examples:
fun searchWithFunctionQuery(queryEmb: FloatArray) {
val reader = DirectoryReader.open(FSDirectory.open(indexPath))
val searcher = IndexSearcher(reader)
val productFunctions = mutableListOf<ValueSource>()
for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
productFunctions +=
ProductFloatFunction(arrayOf(ConstValueSource(qi),
FloatFieldSource(i.toFieldName())))
}
val dotProductQuery =
FunctionQuery(SumFloatFunction(productFunctions.toTypedArray()))
val hits = searcher.search(dotProductQuery, 10).scoreDocs
println("Hits: ${hits.contentToString()}")
// Iterate through the results:
val storedFields = searcher.storedFields()
for (i in hits.indices) {
val hitDoc = storedFields.document(hits[i].doc)
println("Found doc: $hitDoc. Score: ${hits[i].score}")
}
reader.close()
}
fun searchWithExpression(queryEmb: FloatArray) {
val reader = DirectoryReader.open(FSDirectory.open(indexPath))
val searcher = IndexSearcher(reader)
searcher.similarity = TodoSimilarity()
val expressionBuilder = StringBuilder()
val bindings = SimpleBindings()
for ((i, qi) in queryEmb.withIndex()) if (qi != 0f) {
if (expressionBuilder.isNotBlank()) expressionBuilder.append(" + ")
expressionBuilder.append(qi).append(" * ").append(i.toFieldName())
bindings.add(i.toFieldName(),
DoubleValuesSource.fromFloatField(i.toFieldName()))
}
val dotProductExpression =
JavascriptCompiler.compile(expressionBuilder.toString())
val dotProductQuery = FunctionScoreQuery(MatchAllDocsQuery(),
dotProductExpression.getDoubleValuesSource(bindings))
val hits = searcher.search(dotProductQuery, 10).scoreDocs
println("Hits: ${hits.contentToString()}")
// Iterate through the results:
val storedFields = searcher.storedFields()
for (i in hits.indices) {
val hitDoc = storedFields.document(hits[i].doc)
println("Found doc: $hitDoc. Score: ${hits[i].score}")
}
reader.close()
}
Best Regards,
Viacheslav Dobrynin!
сб, 30 нояб. 2024 г. в 22:11, Mikhail Khludnev <m...@apache.org>:
Hi,
Can't it be better done with FunctionQuery and proper ValueSources? Please
also check Lucene Expressions?
On Sat, Nov 30, 2024 at 9:00 PM Viacheslav Dobrynin <w.v.d...@gmail.com>
wrote:
Hello!
I have implemented a custom scoring mechanism. It looks like a dot
product.
I would like to ask you how accurate and effective my implementation is,
could you give me recommendations on how to improve it?
Here are a couple of examples that I want to use this mechanism with.
Example 1:
A document is encoded into a sparse vector, where the terms are the
positions in this vector. A score between a query and a document is
located
as a dot product between their vectors.
To do this, I am building the following documents using PyLucene:
doc = Document()
doc.add(StringField("doc_id", str(doc_id), Field.Store.YES))
doc.add(FloatDocValuesField("term_0", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_1", emb_batch[batch_idx, term].item()))
doc.add(FloatDocValuesField("term_N", emb_batch[batch_idx, term].item()))
To implement the described search mechanism, I implemented the following
Query:
public class FieldValueAsScoreQuery extends Query {
private final String field;
private final float queryTermValue;
public FieldValueAsScoreQuery(String field, float queryTermValue) {
this.field = Objects.requireNonNull(field);
if (Float.isInfinite(queryTermValue) ||
Float.isNaN(queryTermValue)) {
throw new IllegalArgumentException("Query term value must
be finite and non-NaN");
}
this.queryTermValue = queryTermValue;
}
@Override
public Weight createWeight(IndexSearcher searcher, ScoreMode
scoreMode, float boost) {
return new Weight(this) {
@Override
public boolean isCacheable(LeafReaderContext ctx) {
return DocValues.isCacheable(ctx, field);
}
@Override
public Explanation explain(LeafReaderContext context, int
doc)
{
throw new UnsupportedOperationException();
}
@Override
public Scorer scorer(LeafReaderContext context) throws
IOException {
return new Scorer(this) {
private final NumericDocValues iterator =
context.reader().getNumericDocValues(field);
@Override
public float score() throws IOException {
final int docId = docID();
assert docId != DocIdSetIterator.NO_MORE_DOCS;
assert iterator.advanceExact(docId);
return Float.intBitsToFloat((int)
iterator.longValue()) * queryTermValue * boost;
}
@Override
public int docID() {
return iterator.docID();
}
@Override
public DocIdSetIterator iterator() {
return iterator == null ?
DocIdSetIterator.empty() : iterator;
}
@Override
public float getMaxScore(int upTo) {
return Float.MAX_VALUE;
}
};
}
};
}
@Override
public String toString(String field) {
StringBuilder builder = new StringBuilder();
builder.append("FieldValueAsScoreQuery [field=");
builder.append(this.field);
builder.append(", queryTermValue=");
builder.append(this.queryTermValue);
builder.append("]");
return builder.toString();
}
@Override
public void visit(QueryVisitor visitor) {
if (visitor.acceptField(field)) {
visitor.visitLeaf(this);
}
}
@Override
public boolean equals(Object other) {
return sameClassAs(other) && equalsTo(getClass().cast(other));
}
private boolean equalsTo(FieldValueAsScoreQuery other) {
return field.equals(other.field)
&& Float.floatToIntBits(queryTermValue) ==
Float.floatToIntBits(other.queryTermValue);
}
@Override
public int hashCode() {
final int prime = 31;
int hash = classHash();
hash = prime * hash + field.hashCode();
hash = prime * hash + Float.floatToIntBits(queryTermValue);
return hash;
}
}
And then I build boolean query as follows (using PyLucene):
def build_query(query):
builder = BooleanQuery.Builder()
for term in torch.nonzero(query):
field_name = to_field_name(term.item())
value = query[term].item()
builder.add(FieldValueAsScoreQuery(field_name, value),
BooleanClause.Occur.SHOULD)
return builder.build()
it seems to work, but I'm not sure if it's a good way to implement it.
Example 2:
I would also like to use this mechanism for the following index:
term1 -> (doc_id1, score), (doc_idN, score), ...
termN -> (doc_id1, score), (doc_idN, score), ...
Where resulting score will be calculated as:
sum(scores) by doc_id for terms in some query
Thank you in advance!
Best Regards,
Viacheslav Dobrynin!
--
Sincerely yours
Mikhail Khludnev
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org