Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Leo Boytsov Thu, 07 Jul 2016 18:53:29 -0700

Hi David,

thank you for picking it up. Now we are having a more meaningful discussion
regarding the "waste".


Leo,
> There may be confusion here as to where the space is wasted.  1 vs 8 bytes
> per doc on disk is peanuts, sure, but in RAM it is not and that is the
> concern.  AFAIK the norms are memory-mapped in, and we need to ensure it's
> trivial to know which offset to go to on disk based on a document id, which
> precludes compression but maybe you have ideas to improve that.
>

First, my understanding is that all essential parts of the Lucene index are
memory mapped, in particular, the inverted index (in the most common
scenario at least). Otherwise, the search performance is miserable. That
said, memory mapping a few extra bytes per document shouldn't make a
noticeable difference.

Also, judging by the code in the class Lucene53NormsProducer and a debug
session, Lucene only maps a compressed segment containing norm values.
Norms are stored using 1,2,4, or 8 bytes. They are uncompressed into an
8-byte long. This is probably on a per-slice basis.

Anyways, situations in which you will get more than 65536 words per
document are quite rare. Situations with documents having 4 billion words
(or more) are exotic. If you have such enormous documents, again, saving on
document normalization factors won't be your first priority. You would
probably think about the ways of splitting such a huge document containing
every possible keyword into something more manageable.

To sum up,  for 99.999% of the users squeezing normalization factors into a
single byte has absolutely no benefit. Memoization do seem to speed up
things a bit, but I suspect this may disappear with new generations of CPUs.


> To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
> haven't used this but I know where to look)
>

Ok, thanks.


>
> ~ David
>
> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:
>
> > Hi,
> >
> > for some reason I didn't get a reply from the mailing list directly, so I
> > have to send a new message. I appreciate if something can be fixed, so
> that
> > I get a reply as well.
> >
> > First of all, I don't buy the claim about the issue being well-known. I
> > would actually argue that nobody except a few Lucene devs know about it.
> > There is also a bug in Lucene's tutorial example. This needs to be fixed
> as
> > well.
> >
> > Neither do I find your arguments convincing. In particular, I don't think
> > that there is any serious waste of space. Please, see my detailed
> comments
> > below. Please, note that I definitely don't know all the internals well,
> so
> > I appreciate if you could explain them better.
> >
> > The downsides are documented and known. But I don't think you are
> >> fully documenting the tradeoffs here, by encoding up to a 64-bit long,
> >> you can use up to *8x more memory and disk space* than what lucene
> >> does by default, and that is per-field.
> >
> >
> > This is not true. First of all, the increase is only for the textual
> > fields. Simple fields like Integers don't use normalization factors. So,
> > there is no increase for them.
> >
> > In the worst case, you will have 7 extra bytes for a *text* field.
> > However, this is not an 8x increase.
> >
> > If you do *compress* the length of the text field, then its size will
> > depend on the size of the text field. For example, one extra byte will be
> > required for fields that contain
> > more than 256 words, two extra bytes for fields having more than 65536
> > words, and so on so forth. *Compared to the field sizes, a several byte*
> > increase is simply *laughable*.
> >
> > If Lucene saves normalization factor *without compression, *it should now
> > use 8 bytes already. So, storing the full document length won't make a
> > difference.
> >
> >
> >> So that is a real trap. Maybe
> >> throw an exception there instead if the boost != 1F (just don't
> >> support it), and add a guard for "supermassive" documents, so that at
> >> most only 16 bits are ever used instead of 64. The document need not
> >> really be massive, it can happen just from a strange analysis chain
> >> (n-grams etc) that you get large values here.
> >>
> >
> > As mentioned above, storing a few extra bytes for supermassive documents
> > doesn't affect the overall storage by more than a tiny fraction of a
> > percent.
> >
> >
> >>
> >> I have run comparisons in the past on standard collections to see what
> >> happens with this "feature"  and differences were very small. I think
> >> in practice people do far more damage by sharding their document
> >> collections but not using a distributed interchange of IDF, causing
> >> results from different shards to be incomparable :)
> >>
> >
> > Ok, this is not what I see on my data. I see* more than* a 10%
> > degradation. This is not peanuts. Do we want to re-run experiments on
> > standard collections? Don't forget that Lucene is now used as a baseline
> to
> > compare against. People claim to beat BM25 while they beat something
> > inferior.
> >
> >
> >>
> >> As far as the bm25 tableization, that is *not* the justification for
> >> using an 8 byte encoding. The 8 byte encoding was already there long
> >> ago (to save memory/space) when bm25 was added, that small
> >> optimization just takes advantage of it. The optimization exists just
> >> so that bm25 will have comparable performance to ClassicSimilarity.
> >>
> >
> > Sorry, I don't understand this comment. What kind of 8-byte encoding are
> > you talking about? Do you mean a single-byte encoding? This is what the
> > current BM25 similarity seems to use.
> >
> > I also don't quite understand what is a justification for what, please,
> > clarify.
> >
> >
> >>
> >> Either way, the document's length can be stored with more accuracy,
> >> without wasting space, especially if you don't use index-time
> >> boosting. But the default encoding supports features like that because
> >> lucene supports all these features.
> >>
> >
> > Sorry, I don't get this again. Which features should Lucene support? If
> > you like to use boosting in exactly the same way you used it before
> (though
> > I won't recommend doing so), you can do this. In fact, my implementation
> > tries to mimic this as much as possible. If you mean something else,
> > please, clarify.
> >
> > Also, how does one save document length with more accuracy? Is there a
> > special API or something?
> >
> > Thank you!
> >
> >
> >>
> >>
> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote:
> >> > Hi everybody,
> >> >
> >> > Some time ago, I had to re-implement some Lucene similarities (in
> >> particular
> >> > BM25 and the older cosine). I noticed that the re-implemented version
> >> > (despite using the same formula) performed better on my data set. The
> >> main
> >> > difference was that my version did not approximate document length.
> >> >
> >> > Recently, I have implemented a modification of the current Lucene BM25
> >> that
> >> > doesn't use this approximation either. I compared the existing and the
> >> > modified similarities (again on some of my quirky data sets). The
> >> results
> >> > are as follows:
> >> >
> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5%
> >> in my
> >> > tests).
> >> > 2) The modified Lucene BM25 it is also more accurate
> >> > (I don't see a good reason as to why memoization and document
> >> approximation
> >> > results in any efficiency gain at all, but this is what seems to
> happen
> >> with
> >> > the current hardware.)
> >> >
> >> > If this potential accuracy degradation concerns you, additional
> >> experiments
> >> > using more standard collections can be done (e.g., some TREC
> >> collections).
> >> >
> >> > In any case, the reproducible example (which also links to a more
> >> detailed
> >> > explanation) is in my repo:
> >> > https://github.com/searchivarius/AccurateLuceneBM25
> >> >
> >> > Many thanks!
> >> >
> >> > ---
> >> > Leo
> >>
> >>
> >>
> >> ---
> >>
> > Leo
> >
> > --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
>
>
>
> ---
> Leo

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to