Re: discountOverlaps option for QueryParser

Ahmet Arslan Sun, 20 Sep 2015 15:53:32 -0700

Hi Robert,

As I understand, with SynonymQuery, all expansion is recommended to be 
performed on query time only,
and SynonymQuery will take care of the below problem :

"A query for text:TV will expand into (text:TV text:Television) and the lower 
docFreq for text:Television will give the documents that match "Television" a 
much higher score then docs that match "TV" comparably -- which may be somewhat 
counter intuitive to the client. Index time expansion (or reduction) will 
result in the same idf for all documents regardless of which term the original 
text contained."

At the end of the query analysis, if there are tokens at the same position, I 
need to create my SynonymQuery programmatically, right?

Let me explain my concern with another example:

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>

With above analyzer, the query "foo bör" will boost the term "bör" for no 
reason.
Just because bör will be expanded into two terms : bor and bör.
Its contribution to total score is counted two times. I think this is very 
trappy.

With SynonymQuery solution, I will index with StandardTokenizer only.
No expansion at index time.
I will construct the query : new TermQuery('foo') + new SynonymQuery('bor', 
'bör');

Thanks,
Ahmet

On Monday, September 21, 2015 12:33 AM, Robert Muir <[email protected]> wrote:
Hi Ahmet, maybe have a look at the SynonymQuery added in
https://issues.apache.org/jira/browse/LUCENE-6789

For query-time synonyms, it just tries to approximate what happens if
you instead do this work at index-time, by creating a "pseudo-term"
(disjunction of all terms at that same position) summing up the term
frequency across all matching terms before passing to score(). For the
statistics side it takes the maximum DF as the representative DF, and
the sum of the TTF as the representative TTF.

I did relevance experiments with this and the results were positive
over the existing query generated (BooleanQuery with coord disabled),
especially for scoring systems that don't do anything with coord.

On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan <[email protected]> wrote:
> Hello,
>
> Assume that term t1 is expanded into multiple terms (at the same position) 
> during both indexing and query time.
> This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
> preserveOriginal option for instance.
>
> When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
> Score contribution of the term t1 is counted multiple times.
> It is like the query were issued with boosts : t1^3 t2
> This behaviour boosts expanded terms and may not be always desired.
> E.g. (When t2 is a content-bearing word)
>
> I think there should be a flag/switch which is analogous to relationship 
> between discountOverlaps & document's length.
> With this control, overlapping query terms' (tokens with a position of 
> increment of zero) scores are counted once.
> Remaining terms (not overlapping ones) are not affected.
>
> Bruno asked for this functionality in the past : 
> http://find.searchhub.org/document/bb99e435ba35f2b1
>
> What do you think about this? How difficult to implement this?
> Would this be a Lucene or Solr issue?
>
> Thanks,
> Ahmet
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: discountOverlaps option for QueryParser

Reply via email to