Hi Mark, just trying to answer some of your questions. Related with comparisons, I have prepared a small experiment using an EuroGov 2005 Collection subset. Basically I have used the documents of the uk domain (around 60000) and 48 queries, the result are as next
=Lucene= MAP=> .4777 MRR=> .4852 =BM25= MAP=> .5147 MRR=> .5227 As you can see there is an improvement of a 7% in terms of MAP and MRR, using standard parameters for BM25, this is quite an improvement although the collection is not a big one. Indeed in the last years BM25 (and others models) have outperformed the more classical approaches with VSM in competitions like TREC. In terms of performance I have run 100 times the same 48 topic the results are as next Lucene=> 18.7 secs BM25=> 24.8 secs BM25 is approximately 25% slower than Lucene, sure this could be optimized, however 4800 queries in 25secs sounds good to me. About the implementation I have tried to separate the boolean model from the ranking function used, basically there are support for simple (not nested ones) boolean queries with AND, OR and NOT operators. I'm not sure if this is correct from a strict theoretical point of view. BM25 is a probabilistic ranking function which in some way means that just the OR operator should be supported. However the others operators have been added in the same way as Lucene does mixing VSM with boolean model, when different operators are used BM25BooleanScorer is used what is pretty much slower than BM25SingleBooleanScorer, which calculates the scorer when just one boolean operator type is used. The simplified boolean model is implemented with MatchAllBooleanScorer, MustBooleanScorer, ShouldBooleanScorer and NotBooleanScorer, sure this could be rewrited in order to looks more similar to Lucene, but currently I'm not sure how to do it. Currently a termdoc for each term is used in order to filter from the boolean query, and the result docs are ranked with BM25TermScorer. The ranking function is as http://nlp.uned.es/~jperezi/Lucene-BM25/ and is implemented in BM25TermScorer (for one term), I have tried to optimize it as much as I can, but sure this could be improved, any ideas? As you can see there are some stuff to work on it, but I hope that with some help this could be improved, I believe that to add support for BM25 into Lucene framework will be a great step forward for this, currently is quite common to find support for this ranking function in a lot of search engines (Terrier, Sphinx, MG4J, Lemur, Xapian,...) Thanks. ----------------------------------------------------------- Joaquín Pérez Iglesias Dpto. Lenguajes y Sistemas Informáticos E.T.S.I. Informática (UNED) Ciudad Universitaria C/ Juan del Rosal nº 16 28040 Madrid - Spain Phone. +34 91 398 87 25 Fax +34 91 398 65 35 Office 2.07 Email: [EMAIL PROTECTED] ----------------------------------------------------------- Mark Miller <[EMAIL PROTECTED]> escribe : > Cool, thanks. > > Have you done any comparisons with the current scoring system? Can you > claim strong improvements? Have you looked at the performance impact at > all yet? That score method in termscorer looks particularly slow. Could > you explain a little how your bm25 implementation differs from the > current imp in a bit more depth? It looks like your booleanquery is > pretty simplified compared to the old. > > [EMAIL PROTECTED] > wrote: > > Hi Mark, > > > > thank you for your advice, I don't know too much about licenses. > > I have just changed the license to Apache-2.0, hope that this will be ok, > and make things easier. > > > > If you need any help or have some comments about the implementation, > please let me know. I would be really happy if this implementation is finally > integrated into Lucene. > > > > Joaquín. > > > > ----------------------------------------------------------- > > Joaquín Pérez Iglesias > > Dpto. Lenguajes y Sistemas Informáticos > > E.T.S.I. Informática (UNED) > > Ciudad Universitaria > > C/ Juan del Rosal nº 16 > > 28040 Madrid - Spain > > Phone. +34 91 398 87 25 > > Fax +34 91 398 65 35 > > Office 2.07 > > Email: [EMAIL PROTECTED] > > ----------------------------------------------------------- > > > > Mark Miller <[EMAIL PROTECTED]> > escribe : > > > > > >> Hey Joaquin, > >> > >> Your work here looks very interesting. The Lucene community has shown > a > >> strong interest in this area before (see LUCENE-965). > >> > >> I see you went with an lgpl license though. This might be a bit of a > > >> barrier in getting feedback from a community based on apache license > > >> software. Obviously, there still might be interest,learning, and an > > >> exchange of ideas, but none of your code can be distributed with > Lucene, > >> and so what you have done loses some of its appeal in that sense. Is > > >> there any chance you would be willing to relax the license, possibly > > >> gaining more feedback, contributors, and possible inclusion in Lucene? > > >> Certainly not necessary to receive feedback, but I think it would help > > >> -- I'd certainly be looking closer anyway. > >> > >> - Mark > >> > >> Joaquin Perez Iglesias wrote: > >> > >>> Hi all, > >>> > >>> finally I got some time to finish the BM25/BM25F implementation > for > >>> Lucene you can find more details at > >>> http://nlp.uned.es/~jperezi/Lucene-BM25/, > >>> > >> it has been tested but I > >> > >>> cannot assure that is bugs free. > >>> It would be great to receive some feedback about it. > >>> > >>> There are some details about the implementation that I consider > will > >>> be of interest,as how to calculate the average_length or idf at > > >>> document level. > >>> Please if you find any bug or mistake in the supplied > implementation > >>> let me know and I will try to solve it, same for questions. > >>> > >>> Hope that some of you will find useful. > >>> > >>> Thanks in advance. > >>> > >>> > >>> > >>> [EMAIL PROTECTED] > >>> > >> escribió: > >> > >>>> Hi Otis, > >>>> > >>>> as my colleague said, we have a first implementation of BM25 > over > >>>> Lucene, this development is part of a (almost finished) thesis > > >>>> project that compares different IR models, over an standard > > >>>> collection. At the same time we are trying to extend this > first > >>>> implementation in order to support BM25F for multifield > queries, > >>>> unfortunately at this time we are too busy to prepare a final > version > >>>> > >>>> of this code, so we will have to finish this code over the > summer > >>>> (hopefully we will have more time :-))), and make it public at > this > >>>> > >>>> time. > >>>> > >>>> We will inform to this list when we will finish the > preparation of a > >>>> > >>>> final version. > >>>> > >>>> Thanks to everybody for the interest!!! > >>>> > >>>> Bye > >>>> Joaquin > >>>> > >>>> ----------------------------------------------------------- > >>>> Joaquín Pérez Iglesias > >>>> Dpto. Lenguajes y Sistemas Informáticos > >>>> E.T.S.I. Informática (UNED) > >>>> Ciudad Universitaria > >>>> C/ Juan del Rosal nº 16 > >>>> 28040 Madrid - Spain > >>>> Phone. +34 91 398 87 25 > >>>> Fax +34 91 398 65 35 > >>>> Office 2.07 > >>>> Email: [EMAIL PROTECTED] > >>>> ----------------------------------------------------------- > Otis > >>>> Gospodnetic <[EMAIL PROTECTED]> > >>>> > >> escribe : > >> > >>>> > >>>> > >>>>> Hi Jose, > >>>>> > >>>>> I was wondering if you ever got to this. I would love to > see and > >>>>> > >>>>> try BM25 for > >>>>> Lucene! > >>>>> > >>>>> > >>>>> I'm looking at http://code.google.com/soc/2008/asf/about.html > >>>>> and it looks like this didn't make it into GSoC, but this > would > >>>>> > >>>>> still be great > >>>>> to have. > >>>>> > >>>>> Thanks, > >>>>> Otis > >>>>> -- > >>>>> Sematext -- http://sematext.com/ -- > >>>>> Lucene - Solr - Nutch > >>>>> > >>>>> > >>>>> ----- Original Message ---- > >>>>> > >>>>> > >>>>>> From: José Ramón Pérez Agüera <[EMAIL PROTECTED]> > >>>>>> To: java-dev@lucene.apache.org; > >>>>>> > >>>>>> > >>>>> Joaquin Perez-Iglesias <[EMAIL PROTECTED]> > >>>>> > >>>>> > >>>>>> Sent: Saturday, March 15, 2008 4:54:08 AM > >>>>>> Subject: Re: Summer of Code idea for lucene > >>>>>> > >>>>>> we have almost implemented BM25 using lucene > structure, but we > >>>>>> > >> need > >> > >>>>>> help to finish query parser and other details. If you > o > >>>>>> > >> somebody want > >> > >>>>>> We can send you the code and you can help us to > implement the > >>>>>> > >> query > >> > >>>>>> parser and prepare the code to sandbox. > >>>>>> > >>>>>> If there are people interested I can made a web page > for the > >>>>>> > >> project > >> > >>>>>> and put our implementatio to download > >>>>>> > >>>>>> Somebody is interested? > >>>>>> > >>>>>> jose > >>>>>> > >>>>>> -- > >>>>>> José Ramón Pérez Agüera > >>>>>> > >>>>>> Dept. de Ingeniería del Software e Inteligencia > Artificial > >>>>>> Despacho 411 tlf. 913947599 > >>>>>> Facultad de Informática > >>>>>> Universidad Complutense de Madrid > >>>>>> > >>>>>> On Sat, Mar 15, 2008 at 5:32 AM, Ian Holsman wrote: > >>>>>> > >>>>>> > >>>>>>> If no one objects (I don't think it's too late) > >>>>>>> > >>>>>>> would you mind a GSOC project to implement > BM25 > >>>>>>> > >>>>>>> > >>>>> relevancy/scoring? > >>>>> > >>>>> > >>>>> > >> > --------------------------------------------------------------------- > >> > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>>> > >>>> ________________________________________________ > >>>> Servicio WebMail de CiberUNED http://www.uned.es > >>>> > >>>> > >>>> > >>>> > >>>> > >> > --------------------------------------------------------------------- > >> > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> > >> > --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > ________________________________________________ > > Servicio WebMail de CiberUNED http://www.uned.es > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] ________________________________________________ Servicio WebMail de CiberUNED http://www.uned.es --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]