Hi, Univ. of Amsterdam has provided a downloadable version of a language modelling version of Lucene. Their language model is not BM25 but is quite similar in nature. The version is at: http://ilps.science.uva.nl/Resources/#lm-lucen
I have worked on their version a bit, they have created new classes: TermQueryLanguageModel, TermScorerLanguageModel, IndexSearcherLanguageModel, LanguageModelIndexReader etc. I think their work can be useful to you. If you have a successful implementation of BM25, would you be happy to share with us? Jianhan -----Original Message----- From: beatriz ramos [mailto:[EMAIL PROTECTED] Sent: 25 October 2006 16:01 To: java-dev Subject: wrong BM25 implementation in Lucene Hello, this is BM25 algorithm I implement in Lucene. it doen't work because I have compaired my results with the results of MG4J (with the same documents set) I don't know if I have a wrong formule or there are another mistake Could you help me ? ------------------------------------------------------------------------ -------------------------------------------------------- public class BM25Scorer extends Scorer { private final static double EPSILON_SCORE = 1.000000082240371E-9; private final static double DEFAULT_K1 = 0.75d; private final static double DEFAULT_B = 0.95d; private double b = DEFAULT_B; private double k1 = DEFAULT_K1; private IndexReader reader; private Term term; private Hits hits; private int position; // document position in hits private IndexSearcher searcher; private int cooc = 0; // How many times a term appears in the document private float idf; public float score() throws IOException { TermFreqVector tfv = reader.getTermFreqVector( hits.id(position), term.field() ); String[] terms = tfv.getTerms(); int[] freqs = tfv.getTermFrequencies(); for (int i = 0 ; i < terms.length ; i++) { if( terms[i].equalsIgnoreCase(term.text()) ){ cooc = freqs[i]; } } idf = searcher.getSimilarity().idf(term, searcher); Document document = (Document)hits.doc(position); String[] values = document.getValues("DOCUMENT_LENGTH"); // document length is a field of my index long docLength = Long.valueOf(values[0]).longValue(); // document lenght (number of words) long averageLength = 200; double loga = Math.max( EPSILON_SCORE, new Float(idf ).doubleValue()); double score = ( loga * (k1 + 1) * cooc ) / (cooc + k1*( (1-b) + (b*docLength/averageLength) ) ); return new Float(score).floatValue(); } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]