Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

David Smiley Fri, 08 Jul 2016 22:09:07 -0700

I agree that using one byte by default is questionable on modern machines
and given common text field sizes as well. I think my understanding of how
norms are encoding/accessed may be wrong from what I had said.
Lucene53NormsFormat supports Long, I see, and it's clever about observing
the max bytes-per-value needed.  No need for some new format.  It's the
Similarity impls (BM25 is one but others do this too) that choose to encode
a smaller value.  It would be nice to have this be toggle-able!  Maybe just
a boolean flag?


On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote:

> Hi David,
>
> thank you for picking it up. Now we are having a more meaningful
> discussion regarding the "waste".
>
> Leo,
>> There may be confusion here as to where the space is wasted.  1 vs 8 bytes
>> per doc on disk is peanuts, sure, but in RAM it is not and that is the
>> concern.  AFAIK the norms are memory-mapped in, and we need to ensure it's
>> trivial to know which offset to go to on disk based on a document id,
>> which
>> precludes compression but maybe you have ideas to improve that.
>>
>
> First, my understanding is that all essential parts of the Lucene index
> are memory mapped, in particular, the inverted index (in the most common
> scenario at least). Otherwise, the search performance is miserable. That
> said, memory mapping a few extra bytes per document shouldn't make a
> noticeable difference.
>
> Also, judging by the code in the class Lucene53NormsProducer and a debug
> session, Lucene only maps a compressed segment containing norm values.
> Norms are stored using 1,2,4, or 8 bytes. They are uncompressed into an
> 8-byte long. This is probably on a per-slice basis.
>
> Anyways, situations in which you will get more than 65536 words per
> document are quite rare. Situations with documents having 4 billion words
> (or more) are exotic. If you have such enormous documents, again, saving on
> document normalization factors won't be your first priority. You would
> probably think about the ways of splitting such a huge document containing
> every possible keyword into something more manageable.
>
> To sum up,  for 99.999% of the users squeezing normalization factors into
> a single byte has absolutely no benefit. Memoization do seem to speed up
> things a bit, but I suspect this may disappear with new generations of CPUs.
>
>
>> To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
>> haven't used this but I know where to look)
>>
>
> Ok, thanks.
>
>
>>
>> ~ David
>>
>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > for some reason I didn't get a reply from the mailing list directly, so
>> I
>> > have to send a new message. I appreciate if something can be fixed, so
>> that
>> > I get a reply as well.
>> >
>> > First of all, I don't buy the claim about the issue being well-known. I
>> > would actually argue that nobody except a few Lucene devs know about it.
>> > There is also a bug in Lucene's tutorial example. This needs to be
>> fixed as
>> > well.
>> >
>> > Neither do I find your arguments convincing. In particular, I don't
>> think
>> > that there is any serious waste of space. Please, see my detailed
>> comments
>> > below. Please, note that I definitely don't know all the internals
>> well, so
>> > I appreciate if you could explain them better.
>> >
>> > The downsides are documented and known. But I don't think you are
>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit long,
>> >> you can use up to *8x more memory and disk space* than what lucene
>> >> does by default, and that is per-field.
>> >
>> >
>> > This is not true. First of all, the increase is only for the textual
>> > fields. Simple fields like Integers don't use normalization factors. So,
>> > there is no increase for them.
>> >
>>
> > In the worst case, you will have 7 extra bytes for a *text* field.
>
>
>> > However, this is not an 8x increase.
>> >
>>
> > If you do *compress* the length of the text field, then its size will
>
>
>> > depend on the size of the text field. For example, one extra byte will
>> be
>> > required for fields that contain
>> > more than 256 words, two extra bytes for fields having more than 65536
>>
> > words, and so on so forth. *Compared to the field sizes, a several byte*
>> > increase is simply *laughable*.
>> >
>> > If Lucene saves normalization factor *without compression, *it should
>> now
>
>
>> > use 8 bytes already. So, storing the full document length won't make a
>> > difference.
>> >
>> >
>> >> So that is a real trap. Maybe
>> >> throw an exception there instead if the boost != 1F (just don't
>> >> support it), and add a guard for "supermassive" documents, so that at
>> >> most only 16 bits are ever used instead of 64. The document need not
>> >> really be massive, it can happen just from a strange analysis chain
>> >> (n-grams etc) that you get large values here.
>> >>
>> >
>> > As mentioned above, storing a few extra bytes for supermassive documents
>> > doesn't affect the overall storage by more than a tiny fraction of a
>> > percent.
>> >
>> >
>> >>
>> >> I have run comparisons in the past on standard collections to see what
>> >> happens with this "feature"  and differences were very small. I think
>> >> in practice people do far more damage by sharding their document
>> >> collections but not using a distributed interchange of IDF, causing
>> >> results from different shards to be incomparable :)
>> >>
>> >
>>
> > Ok, this is not what I see on my data. I see* more than* a 10%
>
>
>> > degradation. This is not peanuts. Do we want to re-run experiments on
>> > standard collections? Don't forget that Lucene is now used as a
>> baseline to
>> > compare against. People claim to beat BM25 while they beat something
>> > inferior.
>> >
>> >
>> >>
>> >> As far as the bm25 tableization, that is *not* the justification for
>> >> using an 8 byte encoding. The 8 byte encoding was already there long
>> >> ago (to save memory/space) when bm25 was added, that small
>> >> optimization just takes advantage of it. The optimization exists just
>> >> so that bm25 will have comparable performance to ClassicSimilarity.
>> >>
>> >
>> > Sorry, I don't understand this comment. What kind of 8-byte encoding are
>> > you talking about? Do you mean a single-byte encoding? This is what the
>> > current BM25 similarity seems to use.
>> >
>> > I also don't quite understand what is a justification for what, please,
>> > clarify.
>> >
>> >
>> >>
>> >> Either way, the document's length can be stored with more accuracy,
>> >> without wasting space, especially if you don't use index-time
>> >> boosting. But the default encoding supports features like that because
>> >> lucene supports all these features.
>> >>
>> >
>> > Sorry, I don't get this again. Which features should Lucene support? If
>> > you like to use boosting in exactly the same way you used it before
>> (though
>> > I won't recommend doing so), you can do this. In fact, my implementation
>> > tries to mimic this as much as possible. If you mean something else,
>> > please, clarify.
>> >
>> > Also, how does one save document length with more accuracy? Is there a
>> > special API or something?
>> >
>> > Thank you!
>> >
>> >
>> >>
>> >>
>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote:
>> >> > Hi everybody,
>> >> >
>> >> > Some time ago, I had to re-implement some Lucene similarities (in
>> >> particular
>> >> > BM25 and the older cosine). I noticed that the re-implemented version
>> >> > (despite using the same formula) performed better on my data set. The
>> >> main
>> >> > difference was that my version did not approximate document length.
>> >> >
>> >> > Recently, I have implemented a modification of the current Lucene
>> BM25
>> >> that
>> >> > doesn't use this approximation either. I compared the existing and
>> the
>> >> > modified similarities (again on some of my quirky data sets). The
>> >> results
>> >> > are as follows:
>> >> >
>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5%
>> >> in my
>> >> > tests).
>> >> > 2) The modified Lucene BM25 it is also more accurate
>> >> > (I don't see a good reason as to why memoization and document
>> >> approximation
>> >> > results in any efficiency gain at all, but this is what seems to
>> happen
>> >> with
>> >> > the current hardware.)
>> >> >
>> >> > If this potential accuracy degradation concerns you, additional
>> >> experiments
>> >> > using more standard collections can be done (e.g., some TREC
>> >> collections).
>> >> >
>> >> > In any case, the reproducible example (which also links to a more
>> >> detailed
>> >> > explanation) is in my repo:
>> >> > https://github.com/searchivarius/AccurateLuceneBM25
>> >> >
>> >> > Many thanks!
>> >> >
>> >> > ---
>> >> > Leo
>> >>
>> >>
>> >>
>> >> ---
>> >>
>> > Leo
>> >
>> > --
>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> http://www.solrenterprisesearchserver.com
>>
>>
>>
>>
>> ---
>> Leo
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to