[ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060680#comment-13060680
 ] 

Robert Muir commented on LUCENE-3220:
-------------------------------------

Hi David: I had some ideas on stats to simplify some of these sims:
# I think we can use an easier way to compute average document length: 
sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed 
by index-time-boosts, smallfloat quantization, or anything like that.
# To support pivoted unique normalization like lnu.ltc, I think we can solve 
this in a similar way: add sumDocFreq(), which is just a single long, and 
divide this by maxDoc. this gives us avg # of unique terms. I think terrier 
might have a similar stat (#postings or #pointers or something)?

so i think this could make for nice simplifications: especially for switching 
norms completely over to docvalues: we should be able to do #1 immediately 
right now, change the way we compute avgdoclen for e.g. BM25 and DFR.

then in a separate issue i could revert this norm summation stuff to make the 
docvalues integration simpler, and open a new issue for sumDocFreq()


> Implement various ranking models as Similarities
> ------------------------------------------------
>
>                 Key: LUCENE-3220
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3220
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>              Labels: gsoc
>         Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> TODO:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
> Done:

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to