[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

Robert Muir (JIRA) Mon, 28 Sep 2015 08:55:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933478#comment-14933478
 ]


Robert Muir commented on LUCENE-6818:
-------------------------------------

It is not a bug, it is just always how the index-time boost in lucene has 
worked. Boosting a document at index-time is just a way for a user to make it 
artificially longer or shorter.

I don't think we should change this, it makes it much easier for people to 
experiment since all of our scoring models do this the same way. It means you 
do not have to reindex to change the Similarity, for example.

Its easy to understand this as "at search time, the similarity sees the 
"normalized" document length". All I am saying is, these scoring models just 
have to make sure they don't do something totally nuts (like return negative, 
Infinity, or NaN scores) if the user index-time boosts with extreme values: 
extreme values that might not make sense relative to e.g. the collection-level 
statistics for the field. So in my opinion all that is needed, is to add a 
`testCrazyBoosts` that looks a lot like `testCrazySpans`, and just asserts 
those things, ideally across all 256 possible norm values.

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-6818
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6818
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/query/scoring
>    Affects Versions: 5.3
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>              Labels: similarity
>             Fix For: Trunk
>
>         Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

Reply via email to