Re: DisMaxQuery calculating too high sumOfSquaredWeights?

Jan Kurella Fri, 26 Nov 2010 05:51:22 -0800

On 26.11.2010 14:39, ext Jan Kurella wrote:

Hi there,
I was composing a Query like the Solr.DisMaxQueryHandler would do onmy own as I needed a different Tokenizing strategy for non whitespaceseparated languages and more. The concept I took from
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

Assume now the following:
Documents having two fields "title" and "tag". User input can matchany field but must be found almost fully
Document <title:blue star> <tag:have fun>

Query: "blue star fun"

And my Query from my query parser looks like the following:

BooleanQuery (
    DisjunctionMaxQuery (
        SpanTermQuery(title:blue),
        SpanTermQuery(tag:blue)
    ),
    DisjunctionMaxQuery (
        SpanTermQuery(title:star),
        SpanTermQuery(tag:star)
    ),
    DisjunctionMaxQuery (
        SpanTermQuery(title:fun),
        SpanTermQuery(tag:fun)
    ),
    minShouldMatch = 2
)
Obviously this is a "full match", meaning all three terms are found,and from subjective user perspective this should not be a bigdifference in the score to a pure OR-query "blue star fun" with alltokens in the same field. But surprisingly the score from the DMQueryis extremly low!
Looking into it it turns out, that the querynorm multiplied into eachqueryWeight of each SpanTermQuery is very small (0.16). It iscalculated by the BooleanQuery by getting the sum ofsumOfSquaredWeights() of each DMQuery. And here is the problem. Theidf of the STQuery (or a TermQuery) used to elaborate the weight isvery high for a Term not present (that is on purpose) Unfortunatelythe DMQuery takes the highest idf (assuming tie=0.0) from all clauses.
By concept for the whole dismax query the chance that there will be aTerm not found in a concrete DMQuery is near 100%, especially if yousearch across many fields. Thus, the idf of a DMQuery is almost alwaysequal to a Termquery which term will not be found. But For scoringonly the clause of the DMQuery that hit will be taken into account.This leads to too small scores!
What I think would be the correct idf for a DMQuery with pureTermQueries would be rather something like
if any term matches
    take the highest (plus tiestuff) idf from these clauses,
else
    take the highest idf
Unfortunately, when calculating sumOfSquaredWeights(), the idf isalready calculated in a general correct way and I do not see a way toto know inDisjunctionMaxQuery.DisjunctionMaxWeight.sumOfSquaredWeights() whethera returned currentWeight.sumOfSquaredWeights() comes from a TermQuerywhich only term has a df of 0?
How to solve this problem to get a "better" sumOfSquaredWeights() fromDisMaxQuery? The current value does not reflect the intention of thisquery.
Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

What is about this?

public float sumOfSquaredWeights() throws IOException {
    float min = Float.MAX_VALUE, sum = 0.0f;
    for (Weight currentWeight : weights) {
        float sub = currentWeight.sumOfSquaredWeights();
        sum += sub;
        min = Math.min(min, sub);
    }
    if (min == Float.MAX_VALUE) min=0.0f;
    float boost = getBoost();

return (((sum - min) * tieBreakerMultiplier * tieBreakerMultiplier)+ min) * boost * boost;

}





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: DisMaxQuery calculating too high sumOfSquaredWeights?

Reply via email to