Ok, I'm not advocating the BM25 patch neither, unfortunately BM25 was not my idea :-))), and I'm sure that the implementation can be improved.
When you use the BM25 implementation, are you optimising the parameters specifically per collection? (It is a key factor for improving BM25 performance). Why do you think that BM25 works better for English than in other languages (apart of experiments). What are your intuitions? I dont't have too much experience on languages moreover of Spanish and English, and it sounds pretty interesting. Kind Regards. P.S: Maybe this is not a topic for this list??? > Joaquin, I don't see this as a flame war? First of all I'd like to > personally thank you for your excellent BM25 implementation! > > I think the selection of a retrieval model depends highly on the > language/indexing approach, i.e. if we were talking East Asian languages I > think we want a probabilistic model: no argument there! > > All i said was that it is a myth that BM25 is "always" better than > Lucene's > scoring model, it really depends on what you are trying to do, how you are > indexing your text, properties of your corpus, how your queries are > running. > > I don't even want to come across as advocating the lnb.ltc approach > either, > sure I wrote the patch, but this means nothing. I only like it as its > currently a simple integration into Lucene, but long-term its best if we > can > support other models also! > > Finally I think there is something to be said for Lucene's default > retrieval > model, which in my (non-english) findings across the board isn't terrible > at > all... then again I am working with languages where analysis is really the > thing holding Lucene back, not scoring. > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS < > joaquin.pe...@lsi.uned.es> wrote: > >> Just some final comments (as I said I'm not interested in flame wars), >> >> If I obtain better results there are not problem with pooling otherwise >> it >> is biased. >> The only important thing (in my opinion) is that it cannot be said that >> BM25 is a myth. >> Yes, you are right there is not an only ranking model that beats the >> rest, >> but there are models that generally show a better performance in more >> cases. >> >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and >> English (WebCLEF) and Q&A (ResPubliQA) >> >> Ivan checks the parameters (b and k1), probably you can improve your >> results. (that's the bad part of BM25). >> >> Finally we are just speaking of personal experience, so obviously you >> should use the best model for your data and your own experience, on IR >> there are not myths neither best ranking models. If any of us is able to >> find the “best” ranking model, or is able to prove that any >> state-of-the art is a myth he should send these results to the SIGIR >> conference. >> >> Ivan, Robert good luck with your experiments, as I said the good part of >> IR is that you can always make experiments on your own. >> >> > I don't think its really a competition, I think preferably we should >> have >> > the flexibility to change the scoring model in lucene actually? >> > >> > I have found lots of cases where VSM improves on BM25, but then again >> I >> > don't work with TREC stuff, as I work with non-english collections. >> > >> > It doesn't contradict years of research to say that VSM isn't a >> > state-of-the-art model, besides the TREC-4 results, there are CLEF >> results >> > where VSM models perform competitively or exceed (Finnish, Russian, >> etc) >> > BM25/DFR/etc. >> > >> > It depends on the collection, there isn't a 'best retrieval formula'. >> > >> > Note: I have no bias against BM-25, but its definitely a myth to say >> there >> > is a single retrieval formula that is the 'best' across the board. >> > >> > >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS < >> > joaquin.pe...@lsi.uned.es> wrote: >> > >> >> By the way, >> >> >> >> I don't want to start a flame war VSM vs BM25, but I really believe >> that >> >> I >> >> have to express my opinion as Robert has done. In my experience, I >> have >> >> never found a case where VSM improves significantly BM25. Maybe you >> can >> >> find some cases under some very specific collection characteristics, >> (as >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper >> >> parameters) where it can happen. >> >> >> >> BM25 is not just only a different way of length normalization, it is >> >> based >> >> strongly in the probabilistic framework, and parametrises frequencies >> >> and >> >> length. This is probably the most successful ranking model of the >> last >> >> years in Information Retrieval. >> >> >> >> I have never read a paper where VSM improves any of the >> >> state-of-the-art >> >> ranking models (Language Models, DFR, BM25,...), although the VSM >> with >> >> pivoted normalisation length can obtain nice results. This can be >> proved >> >> checking the last years of the TREC competition. >> >> >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last >> 10 >> >> or 15 years of research on Information Retrieval, and I really >> believe >> >> that is not accurate. >> >> >> >> The good thing of Information Retrieval is that you can always make >> your >> >> owns experiments and you can use the experience of a lot of years of >> >> research. >> >> >> >> PS: This opinion is based on experiments on TREC and CLEF >> collections, >> >> obviously we can start a debate about the suitability of this type of >> >> experimentation (concept of relevance, pooling, relevance >> judgements), >> >> but >> >> this is a much more complex topic and I believe is far from what we >> are >> >> dealing here. >> >> >> >> PS2: In relation with TREC4 Cornell used a pivoted length >> normalisation >> >> and they were applying pseudo-relevance feedback, what honestly makes >> >> much >> >> more difficult the analysis of the results. Obviously their results >> were >> >> part of the pool. >> >> >> >> Sorry for the huge mail :-)))) >> >> >> >> > Hi Ivan, >> >> > >> >> > the problem is that unfortunately BM25 >> >> > cannot be implemented overwriting >> >> > the Similarity interface. Therefore BM25Similarity >> >> > only computes the classic probabilistic IDF (what is >> >> > interesting only at search time). >> >> > If you set BM25Similarity at indexing time >> >> > some basic stats are not stored >> >> > correctly in the segments (like docs length). >> >> > >> >> > When you use BM25BooleanQuery this class >> >> > will set automatically the BM25Similarity for you, >> >> > therefore you don't need to do this explicitly. >> >> > >> >> > I tried to make this implementation with the focus on >> >> > not interfering on the typical use of Lucene (so no changing >> >> > DefaultSimilarity). >> >> > >> >> >> Joaquin, Robert, >> >> >> >> >> >> I followed Joaquin's recommendation and removed the call to set >> >> >> similarity >> >> >> to BM25 explicitly (indexer, searcher). The results showed 55% >> >> >> improvement for the MAP score (0.141->0.219) over default >> similarity. >> >> >> >> >> >> Joaquin, how would setting the similarity to BM25 explicitly make >> the >> >> >> score worse? >> >> >> >> >> >> Thank you, >> >> >> >> >> >> Ivan >> >> >> >> >> >> >> >> >> >> >> >> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: >> >> >> >> >> >>> From: Robert Muir <rcm...@gmail.com> >> >> >>> Subject: Re: BM25 Scoring Patch >> >> >>> To: java-user@lucene.apache.org >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM >> >> >>> yes Ivan, if possible please report >> >> >>> back any findings you can on the >> >> >>> experiments you are doing! >> >> >>> >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias >> >> >>> < >> >> >>> joaquin.pe...@lsi.uned.es> >> >> >>> wrote: >> >> >>> >> >> >>> > Hi Ivan, >> >> >>> > >> >> >>> > You shouldn't set the BM25Similarity for indexing or >> >> >>> searching. >> >> >>> > Please try removing the lines: >> >> >>> > writer.setSimilarity(new >> >> >>> BM25Similarity()); >> >> >>> > searcher.setSimilarity(sim); >> >> >>> > >> >> >>> > Please let us/me know if you improve your results with >> >> >>> these changes. >> >> >>> > >> >> >>> > >> >> >>> > Robert Muir escribió: >> >> >>> > >> >> >>> > Hi Ivan, I've seen many cases where BM25 >> >> >>> performs worse than Lucene's >> >> >>> >> default Similarity. Perhaps this is just another >> >> >>> one? >> >> >>> >> >> >> >>> >> Again while I have not worked with this particular >> >> >>> collection, I looked at >> >> >>> >> the statistics and noted that its composed of >> >> >>> several 'sub-collections': >> >> >>> >> for >> >> >>> >> example the PAT documents on disk 3 have an >> >> >>> average doc length of 3543, >> >> >>> >> but >> >> >>> >> the AP documents on disk 1 have an avg doc length >> >> >>> of 353. >> >> >>> >> >> >> >>> >> I have found on other collections that any >> >> >>> advantages of BM25's document >> >> >>> >> length normalization fall apart when 'average >> >> >>> document length' doesn't >> >> >>> >> make >> >> >>> >> a whole lot of sense (cases like this). >> >> >>> >> >> >> >>> >> For this same reason, I've only found a few >> >> >>> collections where BM25's doc >> >> >>> >> length normalization is really significantly >> >> >>> better than Lucene's. >> >> >>> >> >> >> >>> >> In my opinion, the results on a particular test >> >> >>> collection or 2 have >> >> >>> >> perhaps >> >> >>> >> been taken too far and created a myth that BM25 is >> >> >>> always superior to >> >> >>> >> Lucene's scoring... this is not true! >> >> >>> >> >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov >> >> >>> <iprov...@yahoo.com> >> >> >>> >> wrote: >> >> >>> >> >> >> >>> >> I applied the Lucene patch mentioned in >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and >> >> >>> ran the MAP >> >> >>> >>> numbers >> >> >>> >>> on TREC-3 collection using topics >> >> >>> 151-200. I am not getting worse >> >> >>> >>> results >> >> >>> >>> comparing to Lucene DefaultSimilarity. I >> >> >>> suspect, I am not using it >> >> >>> >>> correctly. I have single field >> >> >>> documents. This is the process I use: >> >> >>> >>> >> >> >>> >>> 1. During the indexing, I am setting the >> >> >>> similarity to BM25 as such: >> >> >>> >>> >> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new >> >> >>> StandardAnalyzer( >> >> >>> >>> >> >> >>> Version.LUCENE_CURRENT), true, >> >> >>> >>> >> >> >>> IndexWriter.MaxFieldLength.UNLIMITED); >> >> >>> >>> writer.setSimilarity(new BM25Similarity()); >> >> >>> >>> >> >> >>> >>> 2. During the Precision/Recall measurements, I >> >> >>> am using a >> >> >>> >>> SimpleBM25QQParser extension I added to the >> >> >>> benchmark: >> >> >>> >>> >> >> >>> >>> QualityQueryParser qqParser = new >> >> >>> SimpleBM25QQParser("title", "TEXT"); >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> 3. Here is the parser code (I set an avg doc >> >> >>> length here): >> >> >>> >>> >> >> >>> >>> public Query parse(QualityQuery qq) throws >> >> >>> ParseException { >> >> >>> >>> BM25Parameters.setAverageLength(indexField, >> >> >>> 798.30f);//avg doc length >> >> >>> >>> BM25Parameters.setB(0.5f);//tried >> >> >>> default values >> >> >>> >>> BM25Parameters.setK1(2f); >> >> >>> >>> return query = new >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField, >> >> >>> >>> new >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT)); >> >> >>> >>> } >> >> >>> >>> >> >> >>> >>> 4. The searcher is using BM25 similarity: >> >> >>> >>> >> >> >>> >>> Searcher searcher = new IndexSearcher(dir, >> >> >>> true); >> >> >>> >>> searcher.setSimilarity(sim); >> >> >>> >>> >> >> >>> >>> Am I missing some steps? Does anyone >> >> >>> have experience with this code? >> >> >>> >>> >> >> >>> >>> Thanks, >> >> >>> >>> >> >> >>> >>> Ivan >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >> --------------------------------------------------------------------- >> >> >>> >>> To unsubscribe, e-mail: >> java-user-unsubscr...@lucene.apache.org >> >> >>> >>> For additional commands, e-mail: >> >> java-user-h...@lucene.apache.org >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >> >> >> >>> >> >> >> >>> > -- >> >> >>> > >> >> >>> ----------------------------------------------------------- >> >> >>> > JoaquÃn Pérez Iglesias >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos >> >> >>> > E.T.S.I. Informática (UNED) >> >> >>> > Ciudad Universitaria >> >> >>> > C/ Juan del Rosal nº 16 >> >> >>> > 28040 Madrid - Spain >> >> >>> > Phone. +34 91 398 89 19 >> >> >>> > Fax +34 91 398 65 35 >> >> >>> > Office 2.11 >> >> >>> > Email: joaquin.pe...@lsi.uned.es >> >> >>> > web: >> http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/> >> >> <http://nlp.uned.es/%7Ejperezi/>< >> >> http://nlp.uned.es/%7Ejperezi/> >> >> >>> > >> >> >>> ----------------------------------------------------------- >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> >> --------------------------------------------------------------------- >> >> >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> >>> > For additional commands, e-mail: >> java-user-h...@lucene.apache.org >> >> >>> > >> >> >>> > >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> Robert Muir >> >> >>> rcm...@gmail.com >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > >> >> > >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > >> > -- >> > Robert Muir >> > rcm...@gmail.com >> > >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org