Re: discountOverlaps option for QueryParser

Ahmet Arslan Sun, 20 Sep 2015 17:16:29 -0700

Hi Dough,

Boosting exact matches is not my primary concern.
By the way, ideal way to aggregate scores coming from different fields remains 
unclear.
May be geometric mean is better than summing the field scores?

I just want to warn people, if filters that produce multiple tokens at the same 
position are used carelessly, it can cause some un-obvious boostings in a 
query. 

Thanks,
Ahmet

On Monday, September 21, 2015 2:38 AM, Doug Turnbull 
<[email protected]> wrote:

Another option Ahmet would be to create two fields, one that didn't do ASCII 
folding *without* preserving the original and another that did.  The ASCII 
folded version is a less exacting representation of the text, and the version 
without ASCII folding would be more exacting

My first pass at a solution to your problem would summing the two fields 
scores. Scoring the ASCII folded field provides a higher recall signal. I'll 
call this the "base score." Scoring the non-ASCII folded provides a more 
precise ranking signal. It kicks in only when the searcher types the exact non 
ASCII folded term in. In a sense it acts like how most people think of a boost: 
bonus points for harder to meet but valuable criteria. 

In other words, if you match on just bor, you just get the base score. If you 
match on bör you'd gain the benefit of the base and the additional boost 
scores. The more exacting, non ASCII folded version of the field acts as a 
boost.

On the other hand, if you don't care to differentiate between a match on an 
ASCII folded or non-folded version, than simply create the base ASCII folded 
field and score against that.

Shameless plug, this is exactly the sort of thing we talk quite a bit about in 
John Berryman's and my book, Relevant Search (http://manning.com/turnbull). You 
might find it useful.

Cheers
-Doug

On Sunday, September 20, 2015, Ahmet Arslan <[email protected]> wrote:

Hi Robert,
>
>As I understand, with SynonymQuery, all expansion is recommended to be 
>performed on query time only,
>and SynonymQuery will take care of the below problem :
>
>"A query for text:TV will expand into (text:TV text:Television) and the lower 
>docFreq for text:Television will give the documents that match "Television" a 
>much higher score then docs that match "TV" comparably -- which may be 
>somewhat counter intuitive to the client. Index time expansion (or reduction) 
>will result in the same idf for all documents regardless of which term the 
>original text contained."
>
>
>At the end of the query analysis, if there are tokens at the same position, I 
>need to create my SynonymQuery programmatically, right?
>
>
>Let me explain my concern with another example:
>
><analyzer>
><tokenizer class="solr.StandardTokenizerFactory"/>
><filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
></analyzer>
>
>
>With above analyzer, the query "foo bör" will boost the term "bör" for no 
>reason.
>Just because bör will be expanded into two terms : bor and bör.
>Its contribution to total score is counted two times. I think this is very 
>trappy.
>
>With SynonymQuery solution, I will index with StandardTokenizer only.
>No expansion at index time.
>I will construct the query : new TermQuery('foo') + new SynonymQuery('bor', 
>'bör');
>
>Thanks,
>Ahmet
>
>
>
>
>On Monday, September 21, 2015 12:33 AM, Robert Muir <[email protected]> wrote:
>Hi Ahmet, maybe have a look at the SynonymQuery added in
>https://issues.apache.org/jira/browse/LUCENE-6789
>
>For query-time synonyms, it just tries to approximate what happens if
>you instead do this work at index-time, by creating a "pseudo-term"
>(disjunction of all terms at that same position) summing up the term
>frequency across all matching terms before passing to score(). For the
>statistics side it takes the maximum DF as the representative DF, and
>the sum of the TTF as the representative TTF.
>
>I did relevance experiments with this and the results were positive
>over the existing query generated (BooleanQuery with coord disabled),
>especially for scoring systems that don't do anything with coord.
>
>
>On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan <[email protected]> 
>wrote:
>> Hello,
>>
>> Assume that term t1 is expanded into multiple terms (at the same position) 
>> during both indexing and query time.
>> This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
>> preserveOriginal option for instance.
>>
>> When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
>> Score contribution of the term t1 is counted multiple times.
>> It is like the query were issued with boosts : t1^3 t2
>> This behaviour boosts expanded terms and may not be always desired.
>> E.g. (When t2 is a content-bearing word)
>>
>> I think there should be a flag/switch which is analogous to relationship 
>> between discountOverlaps & document's length.
>> With this control, overlapping query terms' (tokens with a position of 
>> increment of zero) scores are counted once.
>> Remaining terms (not overlapping ones) are not affected.
>>
>> Bruno asked for this functionality in the past : 
>> http://find.searchhub.org/document/bb99e435ba35f2b1
>>
>> What do you think about this? How difficult to implement this?
>> Would this be a Lucene or Solr issue?
>>
>> Thanks,
>> Ahmet
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>
>

-- 

Doug Turnbull | Search Relevance Consultant | OpenSource Connections, LLC | 
240.476.9983 

Author: Relevant Search
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: discountOverlaps option for QueryParser

Reply via email to