[ https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ahmet Arslan updated LUCENE-6818: --------------------------------- Attachment: LUCENE-6818.patch I tried to implement Robert's suggestion at {{TestSimilarityBase#testCrazyIndexTimeBoosts}} It iterates over all possible norm values and 10 different term frequency _tf_ values. NaN, Infinity, Negative values are checked. But I am note sure about the Negative. Some models can return negative scores for certain terms. For example BM25 returns negative scores for common terms. Currently only DFI is tested. Because other models make fail the test in its current form. Some random question: What is the preferred course of action during scoring when term frequency is greater than document length? I think we should simply recommend to use index time boosts only with ClassicSimilarity. I wonder how SweetSpotSimilarity works with index time boosts, where artificially shortening the document length may decrease its rank. > Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr > ------------------------------------------------------------------------------ > > Key: LUCENE-6818 > URL: https://issues.apache.org/jira/browse/LUCENE-6818 > Project: Lucene - Core > Issue Type: New Feature > Components: core/query/scoring > Affects Versions: 5.3 > Reporter: Ahmet Arslan > Assignee: Robert Muir > Priority: Minor > Labels: similarity > Fix For: Trunk > > Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, > LUCENE-6818.patch > > > As explained in the > [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many > state-of-the-art ranking model implementations are added to Apache Lucene. > This issue aims to include DFI model, which is the non-parametric counterpart > of the Divergence from Randomness (DFR) framework. > DFI is both parameter-free and non-parametric: > * parameter-free: it does not require any parameter tuning or training. > * non-parametric: it does not make any assumptions about word frequency > distributions on document collections. > It is highly recommended *not* to remove stopwords (very common terms: the, > of, and, to, a, in, for, is, on, that, etc) with this similarity. > For more information see: [A nonparametric term weighting method for > information retrieval based on measuring the divergence from > independence|http://dx.doi.org/10.1007/s10791-013-9225-4] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org