Re: BM25 Scoring Patch

JOAQUIN PEREZ IGLESIAS Tue, 16 Feb 2010 12:24:33 -0800

Ok,

I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
idea :-))), and I'm sure that the implementation can be improved.


When you use the BM25 implementation, are you optimising the parameters
specifically per collection? (It is a key factor for improving BM25
performance).

Why do you think that BM25 works better for English than in other
languages (apart of experiments). What are your intuitions?

I dont't have too much experience on languages moreover of Spanish and
English, and it sounds pretty interesting.

Kind Regards.

P.S: Maybe this is not a topic for this list???


> Joaquin, I don't see this as a flame war? First of all I'd like to
> personally thank you for your excellent BM25 implementation!
>
> I think the selection of a retrieval model depends highly on the
> language/indexing approach, i.e. if we were talking East Asian languages I
> think we want a probabilistic model: no argument there!
>
> All i said was that it is a myth that BM25 is "always" better than
> Lucene's
> scoring model, it really depends on what you are trying to do, how you are
> indexing your text, properties of your corpus, how your queries are
> running.
>
> I don't even want to come across as advocating the lnb.ltc approach
> either,
> sure I wrote the patch, but this means nothing. I only like it as its
> currently a simple integration into Lucene, but long-term its best if we
> can
> support other models also!
>
> Finally I think there is something to be said for Lucene's default
> retrieval
> model, which in my (non-english) findings across the board isn't terrible
> at
> all... then again I am working with languages where analysis is really the
> thing holding Lucene back, not scoring.
>
> On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> [email protected]> wrote:
>
>> Just some final comments (as I said I'm not interested in flame wars),
>>
>> If I obtain better results there are not problem with pooling otherwise
>> it
>> is biased.
>> The only important thing (in my opinion) is that it cannot be said that
>> BM25 is a myth.
>> Yes, you are right there is not an only ranking model that beats the
>> rest,
>> but there are models that generally show a better performance in more
>> cases.
>>
>> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
>> English (WebCLEF) and Q&A (ResPubliQA)
>>
>> Ivan checks the parameters (b and k1), probably you can improve your
>> results. (that's the bad part of BM25).
>>
>> Finally we are just speaking of personal experience, so obviously you
>> should use the best model for your data and your own experience, on IR
>> there are not myths neither best ranking models. If any of us is able to
>> find the &#8220;best&#8221;  ranking model, or is able to prove that any
>> state-of-the art is a myth he should send these results to the SIGIR
>> conference.
>>
>> Ivan, Robert good luck with your experiments, as I said the good part of
>> IR is that you can always make experiments on your own.
>>
>> > I don't think its really a competition, I think preferably we should
>> have
>> > the flexibility to change the scoring model in lucene actually?
>> >
>> > I have found lots of cases where VSM improves on BM25, but then again
>> I
>> > don't work with TREC stuff, as I work with non-english collections.
>> >
>> > It doesn't contradict years of research to say that VSM isn't a
>> > state-of-the-art model, besides the TREC-4 results, there are CLEF
>> results
>> > where VSM models perform competitively or exceed (Finnish, Russian,
>> etc)
>> > BM25/DFR/etc.
>> >
>> > It depends on the collection, there isn't a 'best retrieval formula'.
>> >
>> > Note: I have no bias against BM-25, but its definitely a myth to say
>> there
>> > is a single retrieval formula that is the 'best' across the board.
>> >
>> >
>> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
>> > [email protected]> wrote:
>> >
>> >> By the way,
>> >>
>> >> I don't want to start a flame war VSM vs BM25, but I really believe
>> that
>> >> I
>> >> have to express my opinion as Robert has done. In my experience, I
>> have
>> >> never found a case where VSM improves significantly BM25. Maybe you
>> can
>> >> find some cases under some very specific collection characteristics,
>> (as
>> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
>> >> parameters) where it can happen.
>> >>
>> >> BM25 is not just only a different way of length normalization, it is
>> >> based
>> >> strongly in the probabilistic framework, and parametrises frequencies
>> >> and
>> >> length. This is probably the most successful ranking model of the
>> last
>> >> years in Information Retrieval.
>> >>
>> >> I have never read a paper where VSM  improves any of the
>> >> state-of-the-art
>> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
>> with
>> >> pivoted normalisation length can obtain nice results. This can be
>> proved
>> >> checking the last years of the TREC competition.
>> >>
>> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
>> 10
>> >> or 15 years of research on Information Retrieval, and I really
>> believe
>> >> that is not accurate.
>> >>
>> >> The good thing of Information Retrieval is that you can always make
>> your
>> >> owns experiments and you can use the experience of a lot of years of
>> >> research.
>> >>
>> >> PS: This opinion is based on experiments on TREC and CLEF
>> collections,
>> >> obviously we can start a debate about the suitability of this type of
>> >> experimentation (concept of relevance, pooling, relevance
>> judgements),
>> >> but
>> >> this is a much more complex topic and I believe is far from what we
>> are
>> >> dealing here.
>> >>
>> >> PS2: In relation with TREC4 Cornell used a pivoted length
>> normalisation
>> >> and they were applying pseudo-relevance feedback, what honestly makes
>> >> much
>> >> more difficult the analysis of the results. Obviously their results
>> were
>> >> part of the pool.
>> >>
>> >> Sorry for the huge mail :-))))
>> >>
>> >> > Hi Ivan,
>> >> >
>> >> > the problem is that unfortunately BM25
>> >> > cannot be implemented overwriting
>> >> > the Similarity interface. Therefore BM25Similarity
>> >> > only computes the classic probabilistic IDF (what is
>> >> > interesting only at search time).
>> >> > If you set BM25Similarity at indexing time
>> >> > some basic stats are not stored
>> >> > correctly in the segments (like docs length).
>> >> >
>> >> > When you use BM25BooleanQuery this class
>> >> > will set automatically the BM25Similarity for you,
>> >> > therefore you don't need to do this explicitly.
>> >> >
>> >> > I tried to make this implementation with the focus on
>> >> > not interfering on the typical use of Lucene (so no changing
>> >> > DefaultSimilarity).
>> >> >
>> >> >> Joaquin, Robert,
>> >> >>
>> >> >> I followed Joaquin's recommendation and removed the call to set
>> >> >> similarity
>> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> >> >> improvement for the MAP score (0.141->0.219) over default
>> similarity.
>> >> >>
>> >> >> Joaquin, how would setting the similarity to BM25 explicitly make
>> the
>> >> >> score worse?
>> >> >>
>> >> >> Thank you,
>> >> >>
>> >> >> Ivan
>> >> >>
>> >> >>
>> >> >>
>> >> >> --- On Tue, 2/16/10, Robert Muir <[email protected]> wrote:
>> >> >>
>> >> >>> From: Robert Muir <[email protected]>
>> >> >>> Subject: Re: BM25 Scoring Patch
>> >> >>> To: [email protected]
>> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
>> >> >>> yes Ivan, if possible please report
>> >> >>> back any findings you can on the
>> >> >>> experiments you are doing!
>> >> >>>
>> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> >> >>> <
>> >> >>> [email protected]>
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Hi Ivan,
>> >> >>> >
>> >> >>> > You shouldn't set the BM25Similarity for indexing or
>> >> >>> searching.
>> >> >>> > Please try removing the lines:
>> >> >>> >   writer.setSimilarity(new
>> >> >>> BM25Similarity());
>> >> >>> >   searcher.setSimilarity(sim);
>> >> >>> >
>> >> >>> > Please let us/me know if you improve your results with
>> >> >>> these changes.
>> >> >>> >
>> >> >>> >
>> >> >>> > Robert Muir escribiÃ³:
>> >> >>> >
>> >> >>> >  Hi Ivan, I've seen many cases where BM25
>> >> >>> performs worse than Lucene's
>> >> >>> >> default Similarity. Perhaps this is just another
>> >> >>> one?
>> >> >>> >>
>> >> >>> >> Again while I have not worked with this particular
>> >> >>> collection, I looked at
>> >> >>> >> the statistics and noted that its composed of
>> >> >>> several 'sub-collections':
>> >> >>> >> for
>> >> >>> >> example the PAT documents on disk 3 have an
>> >> >>> average doc length of 3543,
>> >> >>> >> but
>> >> >>> >> the AP documents on disk 1 have an avg doc length
>> >> >>> of 353.
>> >> >>> >>
>> >> >>> >> I have found on other collections that any
>> >> >>> advantages of BM25's document
>> >> >>> >> length normalization fall apart when 'average
>> >> >>> document length' doesn't
>> >> >>> >> make
>> >> >>> >> a whole lot of sense (cases like this).
>> >> >>> >>
>> >> >>> >> For this same reason, I've only found a few
>> >> >>> collections where BM25's doc
>> >> >>> >> length normalization is really significantly
>> >> >>> better than Lucene's.
>> >> >>> >>
>> >> >>> >> In my opinion, the results on a particular test
>> >> >>> collection or 2 have
>> >> >>> >> perhaps
>> >> >>> >> been taken too far and created a myth that BM25 is
>> >> >>> always superior to
>> >> >>> >> Lucene's scoring... this is not true!
>> >> >>> >>
>> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> >> >>> <[email protected]>
>> >> >>> >> wrote:
>> >> >>> >>
>> >> >>> >>  I applied the Lucene patch mentioned in
>> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> >> >>> ran the MAP
>> >> >>> >>> numbers
>> >> >>> >>> on TREC-3 collection using topics
>> >> >>> 151-200.  I am not getting worse
>> >> >>> >>> results
>> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
>> >> >>> suspect, I am not using it
>> >> >>> >>> correctly.  I have single field
>> >> >>> documents.  This is the process I use:
>> >> >>> >>>
>> >> >>> >>> 1. During the indexing, I am setting the
>> >> >>> similarity to BM25 as such:
>> >> >>> >>>
>> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
>> >> >>> StandardAnalyzer(
>> >> >>> >>>
>> >> >>>    Version.LUCENE_CURRENT), true,
>> >> >>> >>>
>> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >> >>> >>> writer.setSimilarity(new BM25Similarity());
>> >> >>> >>>
>> >> >>> >>> 2. During the Precision/Recall measurements, I
>> >> >>> am using a
>> >> >>> >>> SimpleBM25QQParser extension I added to the
>> >> >>> benchmark:
>> >> >>> >>>
>> >> >>> >>> QualityQueryParser qqParser = new
>> >> >>> SimpleBM25QQParser("title", "TEXT");
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> 3. Here is the parser code (I set an avg doc
>> >> >>> length here):
>> >> >>> >>>
>> >> >>> >>> public Query parse(QualityQuery qq) throws
>> >> >>> ParseException {
>> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
>> >> >>> 798.30f);//avg doc length
>> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
>> >> >>> default values
>> >> >>> >>>   BM25Parameters.setK1(2f);
>> >> >>> >>>   return query = new
>> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >> >>> >>> new
>> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >> >>> >>> }
>> >> >>> >>>
>> >> >>> >>> 4. The searcher is using BM25 similarity:
>> >> >>> >>>
>> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
>> >> >>> true);
>> >> >>> >>> searcher.setSimilarity(sim);
>> >> >>> >>>
>> >> >>> >>> Am I missing some steps?  Does anyone
>> >> >>> have experience with this code?
>> >> >>> >>>
>> >> >>> >>> Thanks,
>> >> >>> >>>
>> >> >>> >>> Ivan
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>>
>> ---------------------------------------------------------------------
>> >> >>> >>> To unsubscribe, e-mail:
>> [email protected]
>> >> >>> >>> For additional commands, e-mail:
>> >> [email protected]
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >>
>> >> >>> > --
>> >> >>> >
>> >> >>> -----------------------------------------------------------
>> >> >>> > JoaquÃn PÃ©rez Iglesias
>> >> >>> > Dpto. Lenguajes y Sistemas InformÃ¡ticos
>> >> >>> > E.T.S.I. InformÃ¡tica (UNED)
>> >> >>> > Ciudad Universitaria
>> >> >>> > C/ Juan del Rosal nÂº 16
>> >> >>> > 28040 Madrid - Spain
>> >> >>> > Phone. +34 91 398 89 19
>> >> >>> > Fax    +34 91 398 65 35
>> >> >>> > Office  2.11
>> >> >>> > Email: [email protected]
>> >> >>> > web:
>> http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>
>> >> <http://nlp.uned.es/%7Ejperezi/><
>> >> http://nlp.uned.es/%7Ejperezi/>
>> >> >>> >
>> >> >>> -----------------------------------------------------------
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>>
>> ---------------------------------------------------------------------
>> >> >>> > To unsubscribe, e-mail: [email protected]
>> >> >>> > For additional commands, e-mail:
>> [email protected]
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Robert Muir
>> >> >>> [email protected]
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> For additional commands, e-mail: [email protected]
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [email protected]
>> >> > For additional commands, e-mail: [email protected]
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>> > --
>> > Robert Muir
>> > [email protected]
>> >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> Robert Muir
> [email protected]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: BM25 Scoring Patch

Reply via email to