[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Arslan updated LUCENE-6818:
---------------------------------
    Attachment: LUCENE-6818.patch

I tried to implement Robert's suggestion at 
{{TestSimilarityBase#testCrazyIndexTimeBoosts}}
It iterates over all possible norm values and 10 different term frequency _tf_ 
values. NaN, Infinity, Negative values are checked. But I am note sure about 
the Negative. Some models can return negative scores for certain terms. For 
example BM25 returns negative scores for common terms.

Currently only DFI is tested. Because other models make fail the test in its 
current form.

Some random question:

What is the preferred course of action during scoring when term frequency is 
greater than document length?


I think we should simply recommend to use index time boosts only with 
ClassicSimilarity. I wonder how SweetSpotSimilarity works with index time 
boosts, where artificially shortening the document length may decrease its rank.

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-6818
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6818
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/query/scoring
>    Affects Versions: 5.3
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>              Labels: similarity
>             Fix For: Trunk
>
>         Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to