[
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834455#action_12834455
]
Joaquin Perez-Iglesias commented on LUCENE-2091:
------------------------------------------------
It is a consequence of the logarithm, you can get negative numbers, and a
negative score doesn't have to much sense. As far as I know this version of IDF
is pretty theoretical and based on the binary independence model (BIR), so
transform the products of probabilities into a summation of logarithms. Anyway
it is quite usual to add a 1 to the final result before applying the logarithm
to avoid situations like previous.
In my opinion it should be added to the patch. It doesn't hurt but it helps :-)
This stuff is clearly explained on the wikipedia
http://en.wikipedia.org/wiki/Okapi_BM25.
Just a quote from Wikipedia
{quote}
Please note that the above formula for IDF shows potentially major drawbacks
when using it for terms appearing in more than half of the corpus documents.
These terms' IDF is negative, so for any two almost-identical documents, one
which contains the term and one which does not contain it, the latter will
possibly get a larger score. This means that terms appearing in more than half
of the corpus will provide negative contributions to the final document score.
This is often an undesirable behavior, so many real-world applications would
deal with this IDF formula in a different way:
* Each summand can be given a floor of 0, to trim out common terms;
* The IDF function *can be given a floor of a constant ε,* to avoid common
terms being ignored at all;
* The IDF function can be replaced with a similarly shaped one which is
non-negative, or strictly positive to avoid terms being ignored at all.
{quote}
> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime
> somewhat.
> I would like to contribute the code to Lucene under contrib.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]