Re: BM25 Scoring Patch

Robert Muir Tue, 16 Feb 2010 11:23:16 -0800

no i mean that gathering your previous emails you have supplied these MAP
improvements:


SweetSpot: 15%
lnb.ltc: 24%
bm25: 21%

these are close enough that given the bias from a pooled collection (
http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf) I
wouldn't want to say for sure that any is better than the other for this
collection, but its probably safe to say they are all an improvement for
this collection... this is assuming your SweetSpot/lnb.btc calculations were
correct?

On Tue, Feb 16, 2010 at 2:16 PM, Ivan Provalov <iprov...@yahoo.com> wrote:

> By the end of the week, I will publish the results once we run the
> experiments on a full collection.  Are you talking about the bias caused by
> using a sub-collection?
>
> Thanks,
>
> Ivan
>
> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:
>
> > From: Robert Muir <rcm...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 2:11 PM
> > Ivan, ok. it would be cool if you can
> > list the map and bpref for the
> > different approaches you try (default lucene, lnb.ltc,
> > bm25), with or
> > without stemming.
> >
> > as you reported previously you got a 24% improvement with
> > lnb.btc (right?) I
> > am guessing that we won't be able to draw many conclusions
> > at all due to
> > bias.
> >
> > On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <iprov...@yahoo.com>
> > wrote:
> >
> > > Robert, Joaquin,
> > >
> > > Sorry, I made an error reporting the results.
> > The preliminary improvement
> > > is around 21% (it's a reduced collection).  I
> > will have to run another test
> > > to get the final numbers on the complete collection.
> > >
> > > We are planning to also apply the stemming.
> > Right now we are trying to
> > > isolate each improvement experiment.
> > >
> > > Thanks,
> > >
> > > Ivan
> > >
> > >
> > >
> > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com>
> > wrote:
> > >
> > > > From: Robert Muir <rcm...@gmail.com>
> > > > Subject: Re: BM25 Scoring Patch
> > > > To: java-user@lucene.apache.org
> > > > Date: Tuesday, February 16, 2010, 1:14 PM
> > > > Ivan just a little more food for
> > > > thought to help you with this:
> > > >
> > > > I'm glad you got improved results, yet I stand by
> > my
> > > > original statement of
> > > > 'be careful' interpreting too much from one
> > collection.
> > > >
> > > > eg. had you chosen TREC-4 instead of TREC-3, you
> > would see
> > > > different
> > > > results, as vector-space with non-cosine doc
> > length norm
> > > > (LUCENE-2187)
> > > > performed better than BM25 there:
> > > > http://trec.nist.gov/pubs/trec4/overview.ps.gz
> > > >
> > > > in truth its hard to 'reuse' a pooled test
> > collection to
> > > > compare methods
> > > > that were not part of the pool:
> > > > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> > > >
> > > > This might help explain why you see such a
> > difference in
> > > > MAP score!
> > > >
> > > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov
> > <iprov...@yahoo.com>
> > > > wrote:
> > > >
> > > > > Joaquin, Robert,
> > > > >
> > > > > I followed Joaquin's recommendation and
> > removed the
> > > > call to set similarity
> > > > > to BM25 explicitly (indexer,
> > searcher).  The
> > > > results showed 55% improvement
> > > > > for the MAP score (0.141->0.219) over
> > default
> > > > similarity.
> > > > >
> > > > > Joaquin, how would setting the similarity to
> > BM25
> > > > explicitly make the score
> > > > > worse?
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Ivan
> > > > >
> > > > >
> > > > >
> > > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > From: Robert Muir <rcm...@gmail.com>
> > > > > > Subject: Re: BM25 Scoring Patch
> > > > > > To: java-user@lucene.apache.org
> > > > > > Date: Tuesday, February 16, 2010, 11:36
> > AM
> > > > > > yes Ivan, if possible please report
> > > > > > back any findings you can on the
> > > > > > experiments you are doing!
> > > > > >
> > > > > > On Tue, Feb 16, 2010 at 11:22 AM,
> > Joaquin Perez
> > > > Iglesias
> > > > > > <
> > > > > > joaquin.pe...@lsi.uned.es>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ivan,
> > > > > > >
> > > > > > > You shouldn't set the
> > BM25Similarity for
> > > > indexing or
> > > > > > searching.
> > > > > > > Please try removing the lines:
> > > > > >
> > >   writer.setSimilarity(new
> > > > > > BM25Similarity());
> > > > > >
> > > >
> > >   searcher.setSimilarity(sim);
> > > > > > >
> > > > > > > Please let us/me know if you
> > improve your
> > > > results with
> > > > > > these changes.
> > > > > > >
> > > > > > >
> > > > > > > Robert Muir escribió:
> > > > > > >
> > > > > > >  Hi Ivan, I've seen many
> > cases where
> > > > BM25
> > > > > > performs worse than Lucene's
> > > > > > >> default Similarity. Perhaps
> > this is just
> > > > another
> > > > > > one?
> > > > > > >>
> > > > > > >> Again while I have not worked
> > with this
> > > > particular
> > > > > > collection, I looked at
> > > > > > >> the statistics and noted that
> > its
> > > > composed of
> > > > > > several 'sub-collections':
> > > > > > >> for
> > > > > > >> example the PAT documents on
> > disk 3 have
> > > > an
> > > > > > average doc length of 3543,
> > > > > > >> but
> > > > > > >> the AP documents on disk 1
> > have an avg
> > > > doc length
> > > > > > of 353.
> > > > > > >>
> > > > > > >> I have found on other
> > collections that
> > > > any
> > > > > > advantages of BM25's document
> > > > > > >> length normalization fall
> > apart when
> > > > 'average
> > > > > > document length' doesn't
> > > > > > >> make
> > > > > > >> a whole lot of sense (cases
> > like this).
> > > > > > >>
> > > > > > >> For this same reason, I've
> > only found a
> > > > few
> > > > > > collections where BM25's doc
> > > > > > >> length normalization is
> > really
> > > > significantly
> > > > > > better than Lucene's.
> > > > > > >>
> > > > > > >> In my opinion, the results on
> > a
> > > > particular test
> > > > > > collection or 2 have
> > > > > > >> perhaps
> > > > > > >> been taken too far and created
> > a myth
> > > > that BM25 is
> > > > > > always superior to
> > > > > > >> Lucene's scoring... this is
> > not true!
> > > > > > >>
> > > > > > >> On Tue, Feb 16, 2010 at 9:46
> > AM, Ivan
> > > > Provalov
> > > > > > <iprov...@yahoo.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>  I applied the Lucene
> > patch
> > > > mentioned in
> > > > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > > > ran the MAP
> > > > > > >>> numbers
> > > > > > >>> on TREC-3 collection using
> > topics
> > > > > > 151-200.  I am not getting worse
> > > > > > >>> results
> > > > > > >>> comparing to Lucene
> > > > DefaultSimilarity.  I
> > > > > > suspect, I am not using it
> > > > > > >>> correctly.  I have
> > single
> > > > field
> > > > > > documents.  This is the process I
> > use:
> > > > > > >>>
> > > > > > >>> 1. During the indexing, I
> > am setting
> > > > the
> > > > > > similarity to BM25 as such:
> > > > > > >>>
> > > > > > >>> IndexWriter writer = new
> > > > IndexWriter(dir, new
> > > > > > StandardAnalyzer(
> > > > > > >>>
> > > > > >    Version.LUCENE_CURRENT),
> > true,
> > > > > > >>>
> > > > > >
> > > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > > >>> writer.setSimilarity(new
> > > > BM25Similarity());
> > > > > > >>>
> > > > > > >>> 2. During the
> > Precision/Recall
> > > > measurements, I
> > > > > > am using a
> > > > > > >>> SimpleBM25QQParser
> > extension I added
> > > > to the
> > > > > > benchmark:
> > > > > > >>>
> > > > > > >>> QualityQueryParser
> > qqParser = new
> > > > > > SimpleBM25QQParser("title", "TEXT");
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> 3. Here is the parser code
> > (I set an
> > > > avg doc
> > > > > > length here):
> > > > > > >>>
> > > > > > >>> public Query
> > parse(QualityQuery qq)
> > > > throws
> > > > > > ParseException {
> > > > > >
> > > >
> > >>>   BM25Parameters.setAverageLength(indexField,
> > > > > > 798.30f);//avg doc length
> > > > > >
> > > >
> > >>>   BM25Parameters.setB(0.5f);//tried
> > > > > > default values
> > > > > >
> > > >
> > >>>   BM25Parameters.setK1(2f);
> > > > > > >>>   return
> > query = new
> > > > > > BM25BooleanQuery(qq.getValue(qqName),
> > > > indexField,
> > > > > > >>> new
> > > > > > >>>
> > > > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> 4. The searcher is using
> > BM25
> > > > similarity:
> > > > > > >>>
> > > > > > >>> Searcher searcher = new
> > > > IndexSearcher(dir,
> > > > > > true);
> > > > > > >>>
> > searcher.setSimilarity(sim);
> > > > > > >>>
> > > > > > >>> Am I missing some
> > steps?  Does
> > > > anyone
> > > > > > have experience with this code?
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>>
> > > > > > >>> Ivan
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > >>> To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > > > > >>> For additional commands,
> > e-mail:
> > > java-user-h...@lucene.apache.org
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > > --
> > > > > > >
> > > > > >
> > > >
> > -----------------------------------------------------------
> > > > > > > Joaquín Pérez Iglesias
> > > > > > > Dpto. Lenguajes y Sistemas
> > Informáticos
> > > > > > > E.T.S.I. Informática (UNED)
> > > > > > > Ciudad Universitaria
> > > > > > > C/ Juan del Rosal nº 16
> > > > > > > 28040 Madrid - Spain
> > > > > > > Phone. +34 91 398 89 19
> > > > > > > Fax    +34 91 398 65 35
> > > > > > > Office  2.11
> > > > > > > Email: joaquin.pe...@lsi.uned.es
> > > > > > > web:   
> > > > > > > http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>
> <http://nlp.uned.es/%7Ejperezi/><
> > > http://nlp.uned.es/%7Ejperezi/> <
> > > > > http://nlp.uned.es/%7Ejperezi/>
> > > > > > >
> > > > > >
> > > >
> > -----------------------------------------------------------
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcm...@gmail.com
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcm...@gmail.com
> > > >
> > >
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: BM25 Scoring Patch

Reply via email to