[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910174#comment-14910174
 ] 

Ahmet Arslan commented on LUCENE-6818:
--------------------------------------

bq. The typical solution is to do something like adjust expected:
Thanks Robert for the suggestion and explanation. Used the typical solution, 
its working now.

bq. I have not read the paper, but these are things to deal with when 
integrating into lucene.
For your information, if you want to look at, Terrier 4.0 source tree has this 
model in DFIC.java

bq.  index-time boosts work on the norm, by making the document appear shorter 
or longer, so docLen might have a "crazy" value if the user does this.
I was relying {{o.a.l.search.similarities.SimilarityBase}} for this but it 
looks like all of its subclasses (DFR, IB) have this problem. I included 
{{TestSimilarityBase#testNorms}} method in the new patch to demonstrate the 
problem. If I am not missing something obvious this is a bug, no?

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-6818
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6818
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/query/scoring
>    Affects Versions: 5.3
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>              Labels: similarity
>             Fix For: Trunk
>
>         Attachments: LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to