Re: Summer of Code idea for lucene

joaquin.perez Sun, 14 Sep 2008 02:54:59 -0700

Hi Mark, just trying to answer some of your questions.

Related with comparisons, I have prepared a small experiment using an EuroGov 
2005 Collection subset. Basically I have used the documents of the uk domain 
(around 60000) and 48 queries, the result are as next


=Lucene=
MAP=> .4777     MRR=> .4852
=BM25=
MAP=> .5147     MRR=> .5227


As you can see there is an improvement of a 7% in terms of MAP and MRR, using 
standard parameters for BM25, this is quite an improvement although the 
collection is not a big one. 
Indeed in the last years BM25 (and others models) have outperformed the more 
classical approaches with VSM in competitions like TREC.

In terms of performance I have run 100 times the same 48 topic the results are 
as next
Lucene=> 18.7 secs
BM25=>   24.8 secs

BM25 is approximately 25% slower than Lucene, sure this could be optimized, 
however 4800 queries in 25secs sounds good to me.

About the implementation I have tried to separate the boolean model from the 
ranking function used, basically there are support for simple (not nested ones) 
boolean queries with AND, OR and NOT operators. 

I'm not sure if this is correct from a strict theoretical point of view. BM25 
is a probabilistic ranking function which in some way means that just the OR 
operator should be supported.
However the others operators have been added in the same way as Lucene does 
mixing VSM with boolean model, when different operators are used 
BM25BooleanScorer is used what is pretty much slower than 
BM25SingleBooleanScorer, which calculates the scorer when just one boolean 
operator type is used.

The simplified boolean model is implemented with MatchAllBooleanScorer, 
MustBooleanScorer, ShouldBooleanScorer and NotBooleanScorer, sure this could be 
rewrited in order to looks more similar to Lucene, but currently I'm not sure 
how to do it. Currently a termdoc for each term is used in order to filter from 
the boolean query, and the result docs are ranked with BM25TermScorer.

The ranking function is as http://nlp.uned.es/~jperezi/Lucene-BM25/ and is 
implemented in BM25TermScorer (for one term), I have tried to optimize it as 
much as I can, but sure this could be improved, any ideas?

As you can see there are some stuff to work on it, but I hope that with some 
help this could be improved, I believe that to add support for BM25  into 
Lucene framework will be a great step forward for this, currently is quite 
common to find support for this ranking function in a lot of search engines 
(Terrier, Sphinx, MG4J, Lemur, Xapian,...)

Thanks.

-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 87 25
Fax    +34 91 398 65 35
Office  2.07
Email: [EMAIL PROTECTED]
-----------------------------------------------------------

Mark Miller <[EMAIL PROTECTED]> escribe :

> Cool, thanks.
> 
> Have you done any comparisons with the current scoring system? Can you 
> claim strong improvements? Have you looked at the performance impact at 
> all yet? That score method in termscorer looks particularly slow. Could 
> you explain a little how your bm25 implementation differs from the 
> current imp in a bit more depth? It looks like your booleanquery is 
> pretty simplified compared to the old.
> 
> [EMAIL PROTECTED]
> wrote:
> > Hi Mark,
> >
> > thank you for your advice, I don't know too much about licenses.
> > I have just changed the license to Apache-2.0, hope that this will be ok,
> and make things easier.
> >
> > If you need any help or have some comments about the implementation,
> please let me know. I would be really happy if this implementation is finally
> integrated into Lucene.
> >
> > Joaquín.
> >
> > -----------------------------------------------------------
> > Joaquín Pérez Iglesias
> > Dpto. Lenguajes y Sistemas Informáticos
> > E.T.S.I. Informática (UNED)
> > Ciudad Universitaria
> > C/ Juan del Rosal nº 16
> > 28040 Madrid - Spain
> > Phone. +34 91 398 87 25
> > Fax +34 91 398 65 35
> > Office 2.07
> > Email: [EMAIL PROTECTED]
> > -----------------------------------------------------------
> >
> > Mark Miller <[EMAIL PROTECTED]>
> escribe :
> >
> >   
> >> Hey Joaquin,
> >>
> >> Your work here looks very interesting. The Lucene community has shown
> a 
> >> strong interest in this area before (see LUCENE-965).
> >>
> >> I see you went with an lgpl license though. This might be a bit of a
> 
> >> barrier in getting feedback from a community based on apache license
> 
> >> software. Obviously, there still might be interest,learning, and an
> 
> >> exchange of ideas, but none of your code can be distributed with
> Lucene, 
> >> and so what you have done loses some of its appeal in that sense. Is
> 
> >> there any chance you would be willing to relax the license, possibly
> 
> >> gaining more feedback, contributors, and possible inclusion in Lucene?
> 
> >> Certainly not necessary to receive feedback, but I think it would help
> 
> >> -- I'd certainly be looking closer anyway.
> >>
> >> - Mark
> >>
> >> Joaquin Perez Iglesias wrote:
> >>     
> >>> Hi all,
> >>>
> >>> finally I got some time to finish the BM25/BM25F implementation
> for 
> >>> Lucene you can find more details at 
> >>> http://nlp.uned.es/~jperezi/Lucene-BM25/,
> >>>       
> >> it has been tested but I 
> >>     
> >>> cannot assure that is bugs free.
> >>> It would be great to receive some feedback about it.
> >>>
> >>> There are some details about the implementation that I consider
> will 
> >>> be of interest,as how to calculate the average_length or  idf at
> 
> >>> document level.
> >>> Please if you find any bug or mistake in the supplied
> implementation 
> >>> let me know and I will try to solve it, same for questions.
> >>>
> >>> Hope that some of you will find useful.
> >>>
> >>> Thanks in advance.
> >>>
> >>>
> >>>
> >>> [EMAIL PROTECTED]
> >>>       
> >> escribió:
> >>     
> >>>> Hi Otis,
> >>>>
> >>>> as my colleague said, we have a first implementation of BM25
> over 
> >>>> Lucene, this development is part of a (almost finished) thesis
> 
> >>>> project that compares different IR models, over an standard
> 
> >>>> collection. At the same time we are trying to extend this
> first 
> >>>> implementation in order to support BM25F for multifield
> queries, 
> >>>> unfortunately at this time we are too busy to prepare a final
> version
> >>>>         
> >>>> of this code, so we will have to finish this code over the
> summer 
> >>>> (hopefully we will have more time :-))), and make it public at
> this
> >>>>         
> >>>> time.
> >>>>
> >>>> We will inform to this list when we will finish the
> preparation of a
> >>>>         
> >>>> final version.
> >>>>
> >>>> Thanks to everybody for the interest!!!
> >>>>
> >>>> Bye
> >>>> Joaquin
> >>>>
> >>>> -----------------------------------------------------------
> >>>> Joaquín Pérez Iglesias
> >>>> Dpto. Lenguajes y Sistemas Informáticos
> >>>> E.T.S.I. Informática (UNED)
> >>>> Ciudad Universitaria
> >>>> C/ Juan del Rosal nº 16
> >>>> 28040 Madrid - Spain
> >>>> Phone. +34 91 398 87 25
> >>>> Fax    +34 91 398 65 35
> >>>> Office  2.07
> >>>> Email: [EMAIL PROTECTED]
> >>>> -----------------------------------------------------------
> Otis 
> >>>> Gospodnetic <[EMAIL PROTECTED]>
> >>>>         
> >> escribe :
> >>     
> >>>>  
> >>>>         
> >>>>> Hi Jose,
> >>>>>
> >>>>> I was wondering if you ever got to this.  I would love to
> see and
> >>>>>           
> >>>>> try BM25 for
> >>>>> Lucene!
> >>>>>
> >>>>>
> >>>>> I'm looking at http://code.google.com/soc/2008/asf/about.html
> >>>>> and it looks like this didn't make it into GSoC, but this
> would
> >>>>>           
> >>>>> still be great
> >>>>> to have.
> >>>>>
> >>>>> Thanks,
> >>>>> Otis
> >>>>> -- 
> >>>>> Sematext -- http://sematext.com/ --
> >>>>> Lucene - Solr - Nutch
> >>>>>
> >>>>>
> >>>>> ----- Original Message ----
> >>>>>    
> >>>>>           
> >>>>>> From: José Ramón Pérez Agüera <[EMAIL PROTECTED]>
> >>>>>> To: java-dev@lucene.apache.org;
> >>>>>>       
> >>>>>>             
> >>>>> Joaquin Perez-Iglesias <[EMAIL PROTECTED]>
> >>>>>    
> >>>>>           
> >>>>>> Sent: Saturday, March 15, 2008 4:54:08 AM
> >>>>>> Subject: Re: Summer of Code idea for lucene
> >>>>>>
> >>>>>> we have almost implemented BM25 using lucene
> structure, but we
> >>>>>>             
> >> need
> >>     
> >>>>>> help to finish query parser and other details. If you
> o
> >>>>>>             
> >> somebody want
> >>     
> >>>>>> We can send you the code and you can help us to
> implement the
> >>>>>>             
> >> query
> >>     
> >>>>>> parser and prepare the code to sandbox.
> >>>>>>
> >>>>>> If there are people interested I can made a web page
> for the
> >>>>>>             
> >> project
> >>     
> >>>>>> and put our implementatio to download
> >>>>>>
> >>>>>> Somebody is interested?
> >>>>>>
> >>>>>> jose
> >>>>>>
> >>>>>> -- 
> >>>>>> José Ramón Pérez Agüera
> >>>>>>
> >>>>>> Dept. de Ingeniería del Software e Inteligencia
> Artificial
> >>>>>> Despacho 411 tlf. 913947599
> >>>>>> Facultad de Informática
> >>>>>> Universidad Complutense de Madrid
> >>>>>>
> >>>>>> On Sat, Mar 15, 2008 at 5:32 AM, Ian Holsman wrote:
> >>>>>>      
> >>>>>>             
> >>>>>>> If no one objects (I don't think it's too late)
> >>>>>>>
> >>>>>>>  would you mind a GSOC project to implement
> BM25
> >>>>>>>         
> >>>>>>>               
> >>>>> relevancy/scoring?
> >>>>>     
> >>>>>
> >>>>>           
> >>
> ---------------------------------------------------------------------
> >>     
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>     
> >>>>>           
> >>>> ________________________________________________
> >>>> Servicio WebMail de CiberUNED http://www.uned.es
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         
> >>
> ---------------------------------------------------------------------
> >>     
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>   
> >>>>         
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>     
> >
> > ________________________________________________
> > Servicio WebMail de CiberUNED http://www.uned.es
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

________________________________________________
Servicio WebMail de CiberUNED http://www.uned.es



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Summer of Code idea for lucene

Reply via email to