Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Leo Boytsov Sat, 09 Jul 2016 11:56:53 -0700

Hi David,

Submitting a patch wouldn't be a problem. But let me do a couple more tests
with more collections (this time I will try more standard&larger ones).


Thanks!

---
Leo

On Sat, Jul 9, 2016 at 10:20 AM, David Smiley <[email protected]>
wrote:

> --ok; (they already have configuration parameters).
>
> Leo if you can submit a patch to extend the BM25 similarity, I would
> welcome it.
>
> On Sat, Jul 9, 2016 at 7:11 AM Robert Muir <[email protected]> wrote:
>
>> Our similarities do not need a boolean flag. Instead we should focus
>> on making them as simple as possible: there can always be alternative
>> implementations.
>>
>> On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]>
>> wrote:
>> > I agree that using one byte by default is questionable on modern
>> machines
>> > and given common text field sizes as well. I think my understanding of
>> how
>> > norms are encoding/accessed may be wrong from what I had said.
>> > Lucene53NormsFormat supports Long, I see, and it's clever about
>> observing
>> > the max bytes-per-value needed.  No need for some new format.  It's the
>> > Similarity impls (BM25 is one but others do this too) that choose to
>> encode
>> > a smaller value.  It would be nice to have this be toggle-able!  Maybe
>> just
>> > a boolean flag?
>> >
>> > On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote:
>> >>
>> >> Hi David,
>> >>
>> >> thank you for picking it up. Now we are having a more meaningful
>> >> discussion regarding the "waste".
>> >>
>> >>> Leo,
>> >>> There may be confusion here as to where the space is wasted.  1 vs 8
>> >>> bytes
>> >>> per doc on disk is peanuts, sure, but in RAM it is not and that is the
>> >>> concern.  AFAIK the norms are memory-mapped in, and we need to ensure
>> >>> it's
>> >>> trivial to know which offset to go to on disk based on a document id,
>> >>> which
>> >>> precludes compression but maybe you have ideas to improve that.
>> >>
>> >>
>> >> First, my understanding is that all essential parts of the Lucene index
>> >> are memory mapped, in particular, the inverted index (in the most
>> common
>> >> scenario at least). Otherwise, the search performance is miserable.
>> That
>> >> said, memory mapping a few extra bytes per document shouldn't make a
>> >> noticeable difference.
>> >>
>> >> Also, judging by the code in the class Lucene53NormsProducer and a
>> debug
>> >> session, Lucene only maps a compressed segment containing norm values.
>> Norms
>> >> are stored using 1,2,4, or 8 bytes. They are uncompressed into an
>> 8-byte
>> >> long. This is probably on a per-slice basis.
>> >>
>> >> Anyways, situations in which you will get more than 65536 words per
>> >> document are quite rare. Situations with documents having 4 billion
>> words
>> >> (or more) are exotic. If you have such enormous documents, again,
>> saving on
>> >> document normalization factors won't be your first priority. You would
>> >> probably think about the ways of splitting such a huge document
>> containing
>> >> every possible keyword into something more manageable.
>> >>
>> >> To sum up,  for 99.999% of the users squeezing normalization factors
>> into
>> >> a single byte has absolutely no benefit. Memoization do seem to speed
>> up
>> >> things a bit, but I suspect this may disappear with new generations of
>> CPUs.
>> >>
>> >>>
>> >>> To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
>> >>> haven't used this but I know where to look)
>> >>
>> >>
>> >> Ok, thanks.
>> >>
>> >>>
>> >>>
>> >>> ~ David
>> >>>
>> >>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> > for some reason I didn't get a reply from the mailing list
>> directly, so
>> >>> > I
>> >>> > have to send a new message. I appreciate if something can be fixed,
>> so
>> >>> > that
>> >>> > I get a reply as well.
>> >>> >
>> >>> > First of all, I don't buy the claim about the issue being
>> well-known. I
>> >>> > would actually argue that nobody except a few Lucene devs know about
>> >>> > it.
>> >>> > There is also a bug in Lucene's tutorial example. This needs to be
>> >>> > fixed as
>> >>> > well.
>> >>> >
>> >>> > Neither do I find your arguments convincing. In particular, I don't
>> >>> > think
>> >>> > that there is any serious waste of space. Please, see my detailed
>> >>> > comments
>> >>> > below. Please, note that I definitely don't know all the internals
>> >>> > well, so
>> >>> > I appreciate if you could explain them better.
>> >>> >
>> >>> > The downsides are documented and known. But I don't think you are
>> >>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit
>> long,
>> >>> >> you can use up to *8x more memory and disk space* than what lucene
>> >>> >> does by default, and that is per-field.
>> >>> >
>> >>> >
>> >>> > This is not true. First of all, the increase is only for the textual
>> >>> > fields. Simple fields like Integers don't use normalization factors.
>> >>> > So,
>> >>> > there is no increase for them.
>> >>> >
>> >>>
>> >>> > In the worst case, you will have 7 extra bytes for a *text* field.
>> >>>
>> >>>
>> >>> > However, this is not an 8x increase.
>> >>> >
>> >>>
>> >>> > If you do *compress* the length of the text field, then its size
>> will
>> >>>
>> >>>
>> >>> > depend on the size of the text field. For example, one extra byte
>> will
>> >>> > be
>> >>> > required for fields that contain
>> >>> > more than 256 words, two extra bytes for fields having more than
>> 65536
>> >>>
>> >>> > words, and so on so forth. *Compared to the field sizes, a several
>> >>> > byte*
>> >>> > increase is simply *laughable*.
>> >>> >
>> >>> > If Lucene saves normalization factor *without compression, *it
>> should
>> >>> > now
>> >>>
>> >>>
>> >>> > use 8 bytes already. So, storing the full document length won't
>> make a
>> >>> > difference.
>> >>> >
>> >>> >
>> >>> >> So that is a real trap. Maybe
>> >>> >> throw an exception there instead if the boost != 1F (just don't
>> >>> >> support it), and add a guard for "supermassive" documents, so that
>> at
>> >>> >> most only 16 bits are ever used instead of 64. The document need
>> not
>> >>> >> really be massive, it can happen just from a strange analysis chain
>> >>> >> (n-grams etc) that you get large values here.
>> >>> >>
>> >>> >
>> >>> > As mentioned above, storing a few extra bytes for supermassive
>> >>> > documents
>> >>> > doesn't affect the overall storage by more than a tiny fraction of a
>> >>> > percent.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> I have run comparisons in the past on standard collections to see
>> what
>> >>> >> happens with this "feature"  and differences were very small. I
>> think
>> >>> >> in practice people do far more damage by sharding their document
>> >>> >> collections but not using a distributed interchange of IDF, causing
>> >>> >> results from different shards to be incomparable :)
>> >>> >>
>> >>> >
>> >>>
>> >>> > Ok, this is not what I see on my data. I see* more than* a 10%
>> >>>
>> >>>
>> >>> > degradation. This is not peanuts. Do we want to re-run experiments
>> on
>> >>> > standard collections? Don't forget that Lucene is now used as a
>> >>> > baseline to
>> >>> > compare against. People claim to beat BM25 while they beat something
>> >>> > inferior.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> As far as the bm25 tableization, that is *not* the justification
>> for
>> >>> >> using an 8 byte encoding. The 8 byte encoding was already there
>> long
>> >>> >> ago (to save memory/space) when bm25 was added, that small
>> >>> >> optimization just takes advantage of it. The optimization exists
>> just
>> >>> >> so that bm25 will have comparable performance to ClassicSimilarity.
>> >>> >>
>> >>> >
>> >>> > Sorry, I don't understand this comment. What kind of 8-byte encoding
>> >>> > are
>> >>> > you talking about? Do you mean a single-byte encoding? This is what
>> the
>> >>> > current BM25 similarity seems to use.
>> >>> >
>> >>> > I also don't quite understand what is a justification for what,
>> please,
>> >>> > clarify.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> Either way, the document's length can be stored with more accuracy,
>> >>> >> without wasting space, especially if you don't use index-time
>> >>> >> boosting. But the default encoding supports features like that
>> because
>> >>> >> lucene supports all these features.
>> >>> >>
>> >>> >
>> >>> > Sorry, I don't get this again. Which features should Lucene
>> support? If
>> >>> > you like to use boosting in exactly the same way you used it before
>> >>> > (though
>> >>> > I won't recommend doing so), you can do this. In fact, my
>> >>> > implementation
>> >>> > tries to mimic this as much as possible. If you mean something else,
>> >>> > please, clarify.
>> >>> >
>> >>> > Also, how does one save document length with more accuracy? Is
>> there a
>> >>> > special API or something?
>> >>> >
>> >>> > Thank you!
>> >>> >
>> >>> >
>> >>> >>
>> >>> >>
>> >>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]>
>> wrote:
>> >>> >> > Hi everybody,
>> >>> >> >
>> >>> >> > Some time ago, I had to re-implement some Lucene similarities (in
>> >>> >> particular
>> >>> >> > BM25 and the older cosine). I noticed that the re-implemented
>> >>> >> > version
>> >>> >> > (despite using the same formula) performed better on my data set.
>> >>> >> > The
>> >>> >> main
>> >>> >> > difference was that my version did not approximate document
>> length.
>> >>> >> >
>> >>> >> > Recently, I have implemented a modification of the current Lucene
>> >>> >> > BM25
>> >>> >> that
>> >>> >> > doesn't use this approximation either. I compared the existing
>> and
>> >>> >> > the
>> >>> >> > modified similarities (again on some of my quirky data sets). The
>> >>> >> results
>> >>> >> > are as follows:
>> >>> >> >
>> >>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower
>> >>> >> > (3-5%
>> >>> >> in my
>> >>> >> > tests).
>> >>> >> > 2) The modified Lucene BM25 it is also more accurate
>> >>> >> > (I don't see a good reason as to why memoization and document
>> >>> >> approximation
>> >>> >> > results in any efficiency gain at all, but this is what seems to
>> >>> >> > happen
>> >>> >> with
>> >>> >> > the current hardware.)
>> >>> >> >
>> >>> >> > If this potential accuracy degradation concerns you, additional
>> >>> >> experiments
>> >>> >> > using more standard collections can be done (e.g., some TREC
>> >>> >> collections).
>> >>> >> >
>> >>> >> > In any case, the reproducible example (which also links to a more
>> >>> >> detailed
>> >>> >> > explanation) is in my repo:
>> >>> >> > https://github.com/searchivarius/AccurateLuceneBM25
>> >>> >> >
>> >>> >> > Many thanks!
>> >>> >> >
>> >>> >> > ---
>> >>> >> > Leo
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ---
>> >>> >>
>> >>> > Leo
>> >>> >
>> >>> > --
>> >>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> >>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> >>> http://www.solrenterprisesearchserver.com
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---
>> >>> Leo
>> >
>> > --
>> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> > http://www.solrenterprisesearchserver.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to