[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

Joaquin Perez-Iglesias (JIRA) Fri, 04 Dec 2009 02:20:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785840#action_12785840
 ]


Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/4/09 10:19 AM:
--------------------------------------------------------------------------

Yes sorry.

Basically what we are trying is to constraint the effect of the raw frequency 
(saturate the frequency). 
In Lucene this is carried out with the root square of the frequency, another 
classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' 
to the frequency.

BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a 
parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be 
fixed by collection.

(Uwe) Related with the IDF issue, I believe that the more correct approach (in 
theoretical terms), would be to use the docFreq on the fields where the user 
wants to search but I don't think that this can be done. 
For example if we have indexed with 3 fields. F1, F2, F3, and the user want to 
search on F1, and F2 there is no way to compute docFreq in both fields. With a 
catch-all field we have docFreq for all fields.

So maybe the best available approach would be to use IDF per field. What do you 
think? 


      was (Author: joaquin):
    Yes sorry.

Basically what we are trying is to constraint the effect of the raw frequency 
(saturate the frequency). 
In Lucene this is carried out with the root square of the frequency, another 
classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' 
to the frequency.

BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a 
parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be 
fixed by collection.
  
> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

Reply via email to