One thing that may be causing problems is that "cooc" is not summing on the
various cases that the "ignore case equality" holds. Since you are ignoring
cases I assume the analyzer being used is not a lower casing one, so in
this case if you have terms f:a and f:A you would get a count of 1 instead
of 2. Also, "cooc" is not intialized with 0 for each call to score(), so,
in case the specific term is not in current doc (although I think this case
is not possible), the value from the previous term would be used.
Try like this?
cooc = 0; // <------ (added)
for (int i = 0 ; i < terms.length ; i++) {
if( terms[i].equalsIgnoreCase(term.text()) ){
cooc += freqs[i]; // <------ (was =)
}
}
But, thinking more about this, if a non-lower-casing analyzer was used, why
comparing with ignore-case? Seems more proper use a lower-casing analyzer
(like StandardAnalyzer) and compare with just equals().
Hope this helps,
Doron
"beatriz ramos" <[EMAIL PROTECTED]> wrote on 25/10/2006
08:00:31:
> Hello, this is BM25 algorithm I implement in Lucene.
>
> it doen't work because I have compaired my results with the results of
MG4J
> (with the same documents set)
>
> I don't know if I have a wrong formule or there are another mistake
>
> Could you help me ?
>
>
--------------------------------------------------------------------------------------------------------------------------------
>
> public class BM25Scorer extends Scorer {
>
> private final static double EPSILON_SCORE = 1.000000082240371E-9;
> private final static double DEFAULT_K1 = 0.75d;
> private final static double DEFAULT_B = 0.95d;
> private double b = DEFAULT_B;
> private double k1 = DEFAULT_K1;
>
> private IndexReader reader;
> private Term term;
> private Hits hits;
> private int position; // document position in hits
> private IndexSearcher searcher;
>
> private int cooc = 0; // How many times a term appears in the
> document
> private float idf;
>
>
> public float score() throws IOException {
> TermFreqVector tfv = reader.getTermFreqVector( hits.id(position),
> term.field() );
>
> String[] terms = tfv.getTerms();
> int[] freqs = tfv.getTermFrequencies();
> for (int i = 0 ; i < terms.length ; i++) {
> if( terms[i].equalsIgnoreCase(term.text()) ){
> cooc = freqs[i];
> }
> }
>
> idf = searcher.getSimilarity().idf(term, searcher);
>
> Document document = (Document)hits.doc(position);
> String[] values = document.getValues("DOCUMENT_LENGTH"); //
> document length is a field of my index
>
> long docLength = Long.valueOf(values[0]).longValue(); //
document
> lenght (number of words)
> long averageLength = 200;
>
> double loga = Math.max( EPSILON_SCORE, new Float(idf
> ).doubleValue());
> double score = ( loga * (k1 + 1) * cooc ) / (cooc + k1*( (1-b) +
> (b*docLength/averageLength) ) );
>
> return new Float(score).floatValue();
> }
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]