Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

David Smiley Sat, 09 Jul 2016 07:21:04 -0700

--ok; (they already have configuration parameters).

Leo if you can submit a patch to extend the BM25 similarity, I would
welcome it.


On Sat, Jul 9, 2016 at 7:11 AM Robert Muir <[email protected]> wrote:

> Our similarities do not need a boolean flag. Instead we should focus
> on making them as simple as possible: there can always be alternative
> implementations.
>
> On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]>
> wrote:
> > I agree that using one byte by default is questionable on modern machines
> > and given common text field sizes as well. I think my understanding of
> how
> > norms are encoding/accessed may be wrong from what I had said.
> > Lucene53NormsFormat supports Long, I see, and it's clever about observing
> > the max bytes-per-value needed.  No need for some new format.  It's the
> > Similarity impls (BM25 is one but others do this too) that choose to
> encode
> > a smaller value.  It would be nice to have this be toggle-able!  Maybe
> just
> > a boolean flag?
> >
> > On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote:
> >>
> >> Hi David,
> >>
> >> thank you for picking it up. Now we are having a more meaningful
> >> discussion regarding the "waste".
> >>
> >>> Leo,
> >>> There may be confusion here as to where the space is wasted.  1 vs 8
> >>> bytes
> >>> per doc on disk is peanuts, sure, but in RAM it is not and that is the
> >>> concern.  AFAIK the norms are memory-mapped in, and we need to ensure
> >>> it's
> >>> trivial to know which offset to go to on disk based on a document id,
> >>> which
> >>> precludes compression but maybe you have ideas to improve that.
> >>
> >>
> >> First, my understanding is that all essential parts of the Lucene index
> >> are memory mapped, in particular, the inverted index (in the most common
> >> scenario at least). Otherwise, the search performance is miserable. That
> >> said, memory mapping a few extra bytes per document shouldn't make a
> >> noticeable difference.
> >>
> >> Also, judging by the code in the class Lucene53NormsProducer and a debug
> >> session, Lucene only maps a compressed segment containing norm values.
> Norms
> >> are stored using 1,2,4, or 8 bytes. They are uncompressed into an 8-byte
> >> long. This is probably on a per-slice basis.
> >>
> >> Anyways, situations in which you will get more than 65536 words per
> >> document are quite rare. Situations with documents having 4 billion
> words
> >> (or more) are exotic. If you have such enormous documents, again,
> saving on
> >> document normalization factors won't be your first priority. You would
> >> probably think about the ways of splitting such a huge document
> containing
> >> every possible keyword into something more manageable.
> >>
> >> To sum up,  for 99.999% of the users squeezing normalization factors
> into
> >> a single byte has absolutely no benefit. Memoization do seem to speed up
> >> things a bit, but I suspect this may disappear with new generations of
> CPUs.
> >>
> >>>
> >>> To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
> >>> haven't used this but I know where to look)
> >>
> >>
> >> Ok, thanks.
> >>
> >>>
> >>>
> >>> ~ David
> >>>
> >>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > for some reason I didn't get a reply from the mailing list directly,
> so
> >>> > I
> >>> > have to send a new message. I appreciate if something can be fixed,
> so
> >>> > that
> >>> > I get a reply as well.
> >>> >
> >>> > First of all, I don't buy the claim about the issue being
> well-known. I
> >>> > would actually argue that nobody except a few Lucene devs know about
> >>> > it.
> >>> > There is also a bug in Lucene's tutorial example. This needs to be
> >>> > fixed as
> >>> > well.
> >>> >
> >>> > Neither do I find your arguments convincing. In particular, I don't
> >>> > think
> >>> > that there is any serious waste of space. Please, see my detailed
> >>> > comments
> >>> > below. Please, note that I definitely don't know all the internals
> >>> > well, so
> >>> > I appreciate if you could explain them better.
> >>> >
> >>> > The downsides are documented and known. But I don't think you are
> >>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit
> long,
> >>> >> you can use up to *8x more memory and disk space* than what lucene
> >>> >> does by default, and that is per-field.
> >>> >
> >>> >
> >>> > This is not true. First of all, the increase is only for the textual
> >>> > fields. Simple fields like Integers don't use normalization factors.
> >>> > So,
> >>> > there is no increase for them.
> >>> >
> >>>
> >>> > In the worst case, you will have 7 extra bytes for a *text* field.
> >>>
> >>>
> >>> > However, this is not an 8x increase.
> >>> >
> >>>
> >>> > If you do *compress* the length of the text field, then its size will
> >>>
> >>>
> >>> > depend on the size of the text field. For example, one extra byte
> will
> >>> > be
> >>> > required for fields that contain
> >>> > more than 256 words, two extra bytes for fields having more than
> 65536
> >>>
> >>> > words, and so on so forth. *Compared to the field sizes, a several
> >>> > byte*
> >>> > increase is simply *laughable*.
> >>> >
> >>> > If Lucene saves normalization factor *without compression, *it should
> >>> > now
> >>>
> >>>
> >>> > use 8 bytes already. So, storing the full document length won't make
> a
> >>> > difference.
> >>> >
> >>> >
> >>> >> So that is a real trap. Maybe
> >>> >> throw an exception there instead if the boost != 1F (just don't
> >>> >> support it), and add a guard for "supermassive" documents, so that
> at
> >>> >> most only 16 bits are ever used instead of 64. The document need not
> >>> >> really be massive, it can happen just from a strange analysis chain
> >>> >> (n-grams etc) that you get large values here.
> >>> >>
> >>> >
> >>> > As mentioned above, storing a few extra bytes for supermassive
> >>> > documents
> >>> > doesn't affect the overall storage by more than a tiny fraction of a
> >>> > percent.
> >>> >
> >>> >
> >>> >>
> >>> >> I have run comparisons in the past on standard collections to see
> what
> >>> >> happens with this "feature"  and differences were very small. I
> think
> >>> >> in practice people do far more damage by sharding their document
> >>> >> collections but not using a distributed interchange of IDF, causing
> >>> >> results from different shards to be incomparable :)
> >>> >>
> >>> >
> >>>
> >>> > Ok, this is not what I see on my data. I see* more than* a 10%
> >>>
> >>>
> >>> > degradation. This is not peanuts. Do we want to re-run experiments on
> >>> > standard collections? Don't forget that Lucene is now used as a
> >>> > baseline to
> >>> > compare against. People claim to beat BM25 while they beat something
> >>> > inferior.
> >>> >
> >>> >
> >>> >>
> >>> >> As far as the bm25 tableization, that is *not* the justification for
> >>> >> using an 8 byte encoding. The 8 byte encoding was already there long
> >>> >> ago (to save memory/space) when bm25 was added, that small
> >>> >> optimization just takes advantage of it. The optimization exists
> just
> >>> >> so that bm25 will have comparable performance to ClassicSimilarity.
> >>> >>
> >>> >
> >>> > Sorry, I don't understand this comment. What kind of 8-byte encoding
> >>> > are
> >>> > you talking about? Do you mean a single-byte encoding? This is what
> the
> >>> > current BM25 similarity seems to use.
> >>> >
> >>> > I also don't quite understand what is a justification for what,
> please,
> >>> > clarify.
> >>> >
> >>> >
> >>> >>
> >>> >> Either way, the document's length can be stored with more accuracy,
> >>> >> without wasting space, especially if you don't use index-time
> >>> >> boosting. But the default encoding supports features like that
> because
> >>> >> lucene supports all these features.
> >>> >>
> >>> >
> >>> > Sorry, I don't get this again. Which features should Lucene support?
> If
> >>> > you like to use boosting in exactly the same way you used it before
> >>> > (though
> >>> > I won't recommend doing so), you can do this. In fact, my
> >>> > implementation
> >>> > tries to mimic this as much as possible. If you mean something else,
> >>> > please, clarify.
> >>> >
> >>> > Also, how does one save document length with more accuracy? Is there
> a
> >>> > special API or something?
> >>> >
> >>> > Thank you!
> >>> >
> >>> >
> >>> >>
> >>> >>
> >>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]>
> wrote:
> >>> >> > Hi everybody,
> >>> >> >
> >>> >> > Some time ago, I had to re-implement some Lucene similarities (in
> >>> >> particular
> >>> >> > BM25 and the older cosine). I noticed that the re-implemented
> >>> >> > version
> >>> >> > (despite using the same formula) performed better on my data set.
> >>> >> > The
> >>> >> main
> >>> >> > difference was that my version did not approximate document
> length.
> >>> >> >
> >>> >> > Recently, I have implemented a modification of the current Lucene
> >>> >> > BM25
> >>> >> that
> >>> >> > doesn't use this approximation either. I compared the existing and
> >>> >> > the
> >>> >> > modified similarities (again on some of my quirky data sets). The
> >>> >> results
> >>> >> > are as follows:
> >>> >> >
> >>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower
> >>> >> > (3-5%
> >>> >> in my
> >>> >> > tests).
> >>> >> > 2) The modified Lucene BM25 it is also more accurate
> >>> >> > (I don't see a good reason as to why memoization and document
> >>> >> approximation
> >>> >> > results in any efficiency gain at all, but this is what seems to
> >>> >> > happen
> >>> >> with
> >>> >> > the current hardware.)
> >>> >> >
> >>> >> > If this potential accuracy degradation concerns you, additional
> >>> >> experiments
> >>> >> > using more standard collections can be done (e.g., some TREC
> >>> >> collections).
> >>> >> >
> >>> >> > In any case, the reproducible example (which also links to a more
> >>> >> detailed
> >>> >> > explanation) is in my repo:
> >>> >> > https://github.com/searchivarius/AccurateLuceneBM25
> >>> >> >
> >>> >> > Many thanks!
> >>> >> >
> >>> >> > ---
> >>> >> > Leo
> >>> >>
> >>> >>
> >>> >>
> >>> >> ---
> >>> >>
> >>> > Leo
> >>> >
> >>> > --
> >>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> >>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> >>> http://www.solrenterprisesearchserver.com
> >>>
> >>>
> >>>
> >>>
> >>> ---
> >>> Leo
> >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to