[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785569#action_12785569
 ] 

Uwe Schindler edited comment on LUCENE-2091 at 12/3/09 10:17 PM:
-----------------------------------------------------------------

Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most 
users also add fields that are e.g. catch-all fields (which would be the per 
doc IDF) but in addition they add special fields like numeric fields (which 
would not produce a good IDF, but at the moment this IDF is ignored). Some 
users also add fields simply for sorting. So a IDF for documents is impossible 
with Lucene. You can only use e.g. catch all fields (which are always a godd 
idea for non-fielded searches, because oring all fields together is slower that 
just indexing the same terms a second time in a catch-all field), e.g. 
"contents" contains all terms from "title", "subject", "mailtext" as an example 
for emails. But the IDF for BM25F could be taken from the "contents" field even 
when searching only for a title.

      was (Author: thetaphi):
    Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most 
uses also add fields that are e.g. catch-all fields (which would be the IDF you 
want to have) but in addition they add special fields like numeric field (which 
would not produce a good IDF, at the moment this IDF is ignored). Some users 
also add fileds simply for sorting. So a IDF for documents is impossible with 
Lucene. You can only use e.g. catch all fields (which are always a godd idea 
for non-fielded searches, because oring all fields together is slower that just 
indexing the same terms a second time in a catch-all field), e.g. "contents" 
contains all terms from "title", "subject", "mailtext" as an example for 
emails. But the IDF for BM25F could be taken from the "contents" field even 
when searching only for a title.
  
> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to